#NoNullPointers - Side-effects of Elm in production

We’ve had a major part of our frontend code running in Elm for approx 9 months now. Without going in-depth into technical implementation details I would like to share my experiences of the project. I’ll be covering the things that led up to the project, the project itself and ongoing learning as it has evolved in production over time.

The back story

At Bellroy, we have a mission.

Inspire better ways to carry. Use business as a force for good. Help the world, and our crew, flourish.

We have many customers who become avid fans of not only our products but also align with the values in our mission.

Our customer support team is excellent. Not just from the customer facing perspective but also communicating the concerns of customers to internal teams in order to improve customer experiences.

They provide a weekly broadcast sharing some of the experiences of our customers. One of the pieces of feedback was about the checkout process on bellroy.com.

About 4 years ago we had decided to move away from an off-the-shelf e-commerce platform and go with an internal Rails application. Some of the reasons included:

We wanted more control over content than the e-commerce platform at that time was able to provide.
Creating and migrating product configuration data between our staging and production environments was problematic, error prone and took a very long time.
We had multilingual support however the current implementation fell short of our desired translations workflow.

So, with moving to a Rails application, we needed to have a checkout. We built the initial version in React. A snapshot of what it was like to work in React a few years ago went something like this:

Integration with React in Rails was done with react-rails gem
No centralised application state management (no Redux)
Multiple choices for JS testing library, each with a limitation at the time requiring one of the capabilities of another of the test libraries
Difficulties in setting up and maintaining Javascript dependencies

After getting it live in 2015, it worked and worked well. Adding small improvements and tiny bits of functionality was trivial and we continued for another year getting the benefits of using a client side framework.

Let’s redesign!

We had complaints from customers about the checkout flow. There were a few steps that needed to change and it wasn’t something that we could support in our current model without significant architectural code changes.

The design hadn’t really altered since the original launch, which meant that it didn’t evolve with the rest of the site. Other parts of the site had undergone significant changes.

There was more that we could do to cater for different devices. We wanted to modify the checkout flow to provide a better experience on the devices and resolutions that had made their way into the market over the past 2-3 years.

Cutting a long story short - approximately five people across two teams spent around 9 months redesigning, rebuilding (using Redux, Webpack and a bunch of other tech that had gained popularity) and launching.

The initial launch was problematic. Here’s a short list of some things that went wrong.

The initial estimates for the rebuild were approx 3 months. We exceeded that time frame by a factor of two.
A bug in the code resulted in us accidentally throwing away some customer contact data over a one-week period, which then had to be chased up by the customer support team.
We couldn’t replicate many of the Javascript errors locally which made root cause identification difficult.
Our implementation of Redux meant that we did not consider some possible state transitions in the checkout flow, which resulted in poor checkout experiences for some visitors.
Overall relative conversion rate dropped ~8% and we couldn’t pinpoint the exact cause.
From a project point of view, there were a number of things that fell short. We had invested so much time into a new design that it was difficult to admit it didn’t give us a return on the investment.

From a software perspective we also identified some things that we did poorly.

We did a full rewrite of the application with a hard cut-over. In retrospect, we should have looked for a more sensible refactoring path.
The tests that we had for the old checkout were at the level of abstraction of that code. We did have some high level tests for the checkout that tested the happy path but we could have done more to ensure that we correctly brought across all other code paths, specifically the ones that dealt with validation and error messaging for users.
Due to our team structure, it meant that the responsibility fell mostly to one or two people doing the rewrite. This made it difficult for other reviewers to get involved until very late in the process when certain architectural decisions had already been committed to.

We all took a step back and rationally worked through some of the reasons for such a poor outcome. Communication failures? Waterfall approach? Poor tooling? Lack of discipline? Failing processes? These reasons and more were all present in our analysis to some degree.

We tried to salvage the project. Jumping in trying to solve the issues, launching a few more times with equally poor (or worse) results. We ended up stopping work on the project and rolling back to our original React version of the checkout.

Experimentation

As a technology team we spend a portion of our time working on in-house tools to scratch our own itches. Aside from getting the productivity benefits of the tools we build, it is also a great way to test and evaluate new technologies safely in a production setting where we could learn from any problems that might arise.

One of these tools is called Morty. It started out as a time tracking tool to help us calibrate our internal models of task/project estimation. It is integrated with Pivotal Tracker which is what we use to keep track of stories and velocity.

As our primary technologies had been Ruby and Javascript for a long time, many team members had started looking to alternatives in the strongly typed functional space.

There had been interest from a few of us in Haskell previously and we had written some CLI tools with it for internal purposes. This led to the discovery of Elm for the front end.

We also run a weekly book club as a technology team. Through this, we started working through the Elm Programming book and exercises which was fundamental in getting everyone comfortable with the language. We were encouraged by the language and how great it felt to work with and so we started building small components in Elm to add/replace functionality in Morty.

Our manager, who is very hands on with writing software, took a week to implement a digital Kanban board in Morty using Elm. Afterwards he did a lightning talk of his experience and the overall solution.

We were all very impressed. The insights as to how it helped with the development process as well as the solution were clear. Elm was at version 0.17 at the time.

A second chance

Bellroy was growing, our customers were still giving us great reviews and we had brought on board a number of key staff to help us develop better customer experiences on bellroy.com.

Talk of updating the checkout flow came up again in conversations. We had been looking at conversions, competitor checkouts and customer feedback over the past 6 months and started to see that our checkout experience was starting to let our customers down.

The business wanted to proceed. We made a bold decision. To write a direct port of the current checkout in Elm, ship it as an A/B test and evaluate.

It is worth addressing the earlier issue of full-rewrites as generally problematic. Part of the earlier reason for the error prone checkout was because we had not catered for all application states. Doing this in Elm meant that the compiler guided us through the design and implementation of those user states. It was a calculated risk based on our experience building the components in Morty.

It took us three months to complete the new checkout. This is from never having programmed in a strongly typed functional language to having the Elm checkout running in production.

There were a few teething issues but overall it was a great success. We weren’t able to detect a measurable difference between the existing React checkout and the Elm checkout. Production errors originating in the Elm variant were nonexistent. We soon switched over to the Elm version permanently.

The main driver

Why did the Elm checkout rewrite succeed?

The built in tooling was simple and easy to use.
Compiler error messages were easy to interpret and resolve.
The Elm guide was very helpful in getting the team up to speed.
Including Elm in our Rails application was relatively simple.
Having side-effects handled by the Elm runtime meant we had more cognitive capacity to reason about the checkout behaviour and flow.
Code reviews were simpler. Knowing that the compiler would catch many more errors, we were able to focus on the business logic of the application.
As we iterated, we had confidence that there were no runtime errors in the application.
Modeling the states of the checkout process proved very simple. Elm directed us to cases that needed to be handled throughout the application as we added more states.
Reasoning about behavioural errors was straight forward. The errors usually occurred at the boundaries of our application where we were making incorrect assumptions about incoming/outgoing data.
We found that there was less code to maintain. 12,676 lines for the React checkout. 7,246 lines for the Elm checkout. That’s always a nice win!

I’m not sure if it is possible to discern a common theme but I kept coming back to one thought in particular.

We were getting feedback on avoidable errors as we were developing. Essentially for free, after the initial learning curve of the language. Yes, we had to learn a new language, but all in all, it was about as difficult as learning React. If those costs are similar, we were getting a lot more value from Elm out of the box.

Acceleration, finally!

An important requirement for a business within a growth phase is to keep momentum on key growth areas. We had been A/B testing new site layouts, designs, copy, content, etc throughout bellroy.com with much success.

Side note: It’s important to point out here that success didn’t always equate to financial improvement. For example, we wanted to ensure that customer engagement remained high. Or perhaps we changed the chat widget to see which ones were more inviting for visitors to ask questions. Certainly for branding exercises when you want to communicate something specific, actually taking a hit financially is something that you need to weigh up as a business.

So with the testing of other aspects of the site going well, we were now able to apply those principles to the checkout. Within a 6 month period we A/B tested:

a new responsive layout
reordering of key checkout steps
a 2-page checkout against a 1-page checkout
the inclusion of a number of express checkout methods (Paypal Express, Apple Pay, etc)
an upgrade to Elm 0.19 for a performance improvement

Although not all of these experiments improved conversion rates, our developers were no longer a bottleneck for the UX and design teams to try new things. We were able to enhance and fearlessly refactor the checkout to adapt to all of our integration tests and continue to build a great experience for our customers.

For our development team, it meant that our one frontend specialist was able to spend less time maintaining code for features that were already shipped. Their focus was directed to other growth projects and allowed for more opportunities to be explored.

This project improved the bond and communication between our team and the other teams because we were able to iterate, evolve and move with greater speed. The communication feedback loop was tightened, leading to a much more efficient and enjoyable way of working as a larger team.

Writing this small but critical part of our application in Elm has paid for itself many times over and will continue paying for itself because it has minimised the maintenance burden over time for the entire lifetime of the application.

Downsides?

Although we hadn’t experienced a single runtime error, there were things that were not ideal.

Where are the escape hatches?

The way to interact with the outside world is via Elm ports. There are limitations to this approach though. One we encountered was in relation to .focus() on iOS devices. Here’s a comment verbatim from the code.

* In order for .focus() to work on iOS devices the focus action must be handled directly in the event callback of the click event.
* an Elm.embed (Elm.application/document would work) architecture prevents this from working since the focus function can only be handled through a port.
* And that port Cmd can only be called after the Msg has gone through the update loop.

These things can be done but it feels like there a few more hoops to jump through to get them to work. We ended up using WebComponents to solve this issue which was straight forward enough but required some additional engineering on our part to get it working safely.

If your app or components are going to deal with these cases more often then you need to be prepared to (at the very least) commit to more than just working on your immediate software goals and invest more time to libraries and tooling so that you’ve got something that you can work with effectively.

Catering for unexpected data

This was a one that caught us off-guard. We have specific requirements about the data that we collect due to requirements from our logistics provider. One of these requirements is that in certain countries we are required to capture the state information from the customer.

When implementing one of the express payment methods, we didn’t realise that even though the field is shown as required in the response according to the docs, we neglected to check if the value was blank. This caused a few cascading effects down the line for some customers who we then had to contact to confirm their delivery address.

This is a case of where Elm does have bugs, depending on your definition of a bug. Whilst the code will not crash, the behaviour can be incorrect with respect to the business domain.

Strong static typing does not mean you can delete all your tests

Having a strong type system will not solve all of your problems. Functional programming will not solve all of your problems. It’s really important to point this out. However, I do advocate for both functional programming and a powerful type system because they solve many fundamental problems. You’re still gonna write tests for algorithms, effects and values that you can’t enforce (or don’t want to enforce) at the type level.

It is important to understand where you need tests and ensure that they are a part of your design and development process.

Evidence? No. A lesson? Yes.

Evolving the checkout over time in Elm felt easier than working with the React checkout ever had.

Anecdotally, everyone who worked on that code was able to reason about it quickly and effectively. The real test was bringing in some other peers from the team to implement some of the new features. It turned out (after the initial language barrier) that they experienced a sense of happiness at deploying a new feature to a critical part of the site, knowing that they weren’t about to cause the checkout to crash in production.

Part of me wishes there was a way to do a study of these alternate development scenarios but I think it’s very hard to reproduce the results. There’s so much context which is required in order to understand and measure whether a particular contributing factor was statistically significant in the result.

I prefer to reflect in context of the work environment we had created for ourselves at the time. Elm was not the sole reason the project succeeded but it make a significant contribution due to it being brought into the environment at the right time.

Would we do it again? Sure. We did. A month or two after the launch of the checkout we also ported the mini-cart on the site to Elm with similar success. We’ll continue to use Elm for all new code and gradually move the remaining Javascript components to it as we work through those parts of the application at a later date.

There was such a strong feeling within the group that Elm was a core contributor to the success of the project that I started implementing some of our backend services in Haskell to see if we can yield similar results regarding efficient reasoning and effective project evolution over both the short and long term.

Takeaways

Introducing Elm made for an increase to overall developer, team, customer and business happiness. These things are real and not often spoken about when implementing new technical projects.

Elm did a really interesting thing for us. It raised the quality bar for Javascript whilst lowering the entry bar for pure functional programming at the same time. As developers working in the application/library design space, we know how hard it can be to achieve the right balance for our intentions.

The focus from the development team, when doing retrospectives, is often about the technical details. However, the effects at the team boundaries that I’ve covered in the post are important to me because they will allow me to make arguments for better ways to write software as I continue along my software journey.

Handling errors. At runtime or compile time? There has been much debate about which is preferred. I suspect it heavily depends on the context and your use case. In this case, bringing the errors forward has meant that we were able to focus our energy and attention on other priorities. It has meant that the business is seizing upcoming opportunities instead of seeing them drift by the wayside because of resource constraints.

If you’re interested in introducing functional programming into your own work, I hope that a story like this can inspire you and your business leaders to take the leap into this paradigm and evaluate it for yourselves.

This is not just a story about Elm and functional programming coming to save the day now and forever more. I prefer to think of it as a story about achieving a solution of “good fit” for a given problem and the context in which it resides.

We all can get trapped into technologies and paradigms and forget to look around us. What I’ve described in this post may or may not work for your context. In any case, I wanted to paint enough of a picture that might allow you to see if the context warrants thinking about this at all.

As always, I hope this is useful information to you and thanks for reading!

If you like this post and want to get in touch, please reach out and follow me on Twitter.

Some special thanks are required for this post. Firstly to all of the great people who have made Elm what it is today. To the team at Bellroy for taking it on with courage and determination. To my managers and leaders for allowing us to experiment and trust us with appropriate technologies to solve real-world problems. And to my colleagues who spent their valuable time helping review this post. Thank you all!