Case studies in rearchitecting – Increment: Software Architecture

Several years ago, a retailer approached ThoughtWorks, a global software consultancy firm, for a redesign of their system: It wanted a centralized architecture. When ThoughtWorks CTO Rebecca Parsons and her team presented the requested redesign, a member of the client’s team asked, “Okay, but what happens when we lose Scotland?”

Though Parsons, along with Patrick Kua and ThoughtWorks’ Neal Ford, wrote a book on the subject—Building Evolutionary Architectures—she had never before encountered a client who expressed concern about “losing” a country. She learned that, at some point in this client’s past, a freak software failure had resulted in a total loss of communication between their head office and its stores in Scotland for several days. So traumatizing was this failure, it was burned into the organization’s collective memory.

Although the client had requested it, a centralized architecture would not do in this case. The retailer needed the additional safety net of being able to operate offline for multiple hours, if not days. It needed something of a hybrid: part centralized, part decentralized, with significant capacity for offline processing.

“There is no such thing as one right architecture,” Parsons says. “The right architecture depends on the industry you’re in, the specifics of the application, [and] the organization.”

Increment spoke with four organizations that have rearchitected their software in response to drastic and distinct changes: Buffer, which had the immense task of resolving two monoliths after making an acquisition; ThoughtWorks, which has developed a tried-and-true approach to rearchitecting legacy systems over decades of client work; mobile bank N26, which found itself on a hockey-stick growth curve and realized only an architectural change could help it cope; and Zapier, which overhauled its architecture in response to customer needs.

Buffer: acquiring a monolith

By 2016, six years after CEO Joel Gascoigne started Buffer as a side project, the company had cemented itself as a go-to social media publishing service for businesses. Unusually, it was also becoming known for its all-remote team structure, which the company discussed transparently in blog posts, along with its experiments in management and product.

Behind the scenes, its rapid growth stretched the company’s monolithic architecture to its limits. Buffer was built on PHP using the CodeIgniter framework, which was connected to a large MongoDB database and was running on AWS’s Elastic Beanstalk. “We had too many engineers contributing to the same codebase, and there was a lot of fragility,” says CTO Dan Farrelly. Farrelly has been at Buffer since its early days and was working as a software architect at the time.

Being a wholly remote company comes with its own unique communication challenges and overhead. “Everyone’s committing and shipping code 24 hours a day, throughout different parts of the day, to a single large system,” Farrelly says. For Buffer, every change an engineer made meant they had to share it with 15 other people in different time zones, whether by video call or in writing, and they couldn’t keep up that level of communication.

Farrelly and the team decided to break the monolith into microservices, experimenting with Docker and eventually building their first orchestration services with Kubernetes. “In a nutshell, we needed to free up different parts of our application [so they could] be worked on more independently,” he explains.

In the midst of it all, Buffer acquired Respondly, a software that enabled social media teams to respond to customers on platforms like Twitter. But Respondly had a monolith of its own, meaning Buffer now oversaw two monoliths that were not connected in any way. It began to feel like there were “two separate companies operating within Buffer, and they didn’t communicate at all with each other,” Farrelly says. “[They were] disconnected from a team level [and] from an architectural level.”

Buffer had two options for moving forward: have one massive product and add features as customer needs evolved, or reimagine Buffer as a platform that would support multiple complementary products. In March 2017, Gascoigne laid out a vision for Buffer as a multiproduct platform. The company’s original social media publishing tool, which then was bringing in over 99 percent of its revenue, would be rebranded as Publish. Respondly would become Reply. And Buffer would build a third product, an advanced social media analytics tool called Analyze.

But for this to happen, Buffer needed a set of core functionalities that all its products would share: user login and authentication, billing, sessions, centralized storage of a user’s social media account connections—in short, a new foundational architecture.

While waiting for their product teams to build and test Analyze’s MVP and align their individual road maps, Farrelly and another senior engineer met in New York in November 2017 to plan shared services and foundational architecture. They started to break the two monoliths into specific segments. About 70 percent of the time, Farrelly estimates, these segments started to function more as a service-oriented architecture. In mid-2018, Analyze launched as an add-on to Publish; it could be consolidated as a stand-alone product only after a foundational architecture was put in place. Then, in October 2018, Buffer was able to properly resource its foundational architecture changes, and a Core team, led by Farrelly, was formed with three dedicated engineers and a designer. Buffer now had five product teams in total: Core, Publish, Reply, Analyze, and Mobile.

“We’re not dogmatic about monolith versus service. We’re not a regimented organization. We’ve never had any very strong processes around specifications or requirements,” Farrelly says. “But we are user-first, product-first.” So the Core team began to build an authentication service that would allow users to sign in once to access all three of Buffer’s products.

Farrelly says, “We had always delayed that because, from a product standpoint, it wasn’t as validated. But especially with a third product, we figured the friction to sign up for a second product needed to be zero.” The team decided that all customers would be linked in the product by their email addresses; if they had used the same email address for Publish and Reply, their accounts would be matched and combined into a central database and authentication and account-management system. The team debated allowing users to self-enroll and connect their two accounts themselves, but that user experience would not do—it had to be as seamless as possible. With the automatic merging of a user’s multiple accounts, the user would only notice the pretty new login page the team was simultaneously building to go with the new system.

In April 2019, the company started to migrate Reply’s users, the smaller of the two user bases, onto its new authentication and account-management system. Next came Publish’s users, migrating in six hours to a system that had been six months in the making.

Finally, in July 2019, Analyze launched as a stand-alone product on top of this shared foundation, with Buffer looking and functioning like a platform of three distinct products.

Architecture can be thought of in terms of communication patterns, API design, bounded contexts, and coupling and decoupling. But, to Farrelly, these are implementation details. Architecture is above all in the service of the company vision, its product, and its users.

ThoughtWorks: rearchitecting for incremental change

It’s one thing to start with a blank slate, making effective choices about your technology and architecture as your product organically takes shape. It’s another to realize your architecture needs modernization because something has changed—your organization, your product, your market, your technology, or just the world around you—while your software hasn’t. For several of ThoughtWorks’ clients, the challenge is to modernize legacy systems that are at the core of their businesses.

In 2004, Rebecca Parsons’s colleague Martin Fowler started to use the analogy of a strangler fig tree to describe the challenge of building a new architecture around a legacy system and using it for support until it’s no longer needed. The goal is to completely or largely replace the legacy system incrementally over time, rather than in one fell swoop. One way to replace such an old system is to first build a scaffold of new technology around it, “because that system right now owns the data,” Parsons explains. “You effectively have two-way scaffolding to go in and out. And then you can gradually start to extract business processes over into this new world.”

One of ThoughtWorks’ clients had a legacy risk-management system that served its commodities-trading business but needed replacing. The primary function of this outmoded system was to record data, which the client could then access as needed. Parsons’s team created a simple one-directional scaffolding around this system—the ability to change the data in the repository was unnecessary—and over time moved more and more functionality to it.

Parsons’s team was confident they had moved everything over into the new system. The client, however, wasn’t convinced, and didn’t want to shut the legacy system down. So ThoughtWorks decided to let the old system run until it crashed, as machines are wont to do. When it did, nobody noticed. After this successful—and only somewhat accidental—test, they sunsetted the old system.

But legacy transformations don’t always have to end in the legacy system being shuttered, Parsons says. It’s all about making a conscious choice. The goal with a successful strangler fig–style rearchitecture is to first understand what’s truly needed from the old system, rather than just trying to mimic it. “From a process perspective, because the systems have been around for so long, and usually have a significant amount of technical debt, people have changed their business processes to work around the deficiencies of the core system,” she explains. “You don’t want to replicate that process. You want to think, ‘How would I like to achieve this business objective, and what kind of data and technology and processing do I need to do that?’”

A ThoughtWorks client several years ago operated a trading system. “When most people hear ‘trading system,’ they think high throughput, low latency, performance is king. [With] this particular trading system, they figured they might do 200 transactions a day [at most]. But each one of those transactions was worth billions of dollars,” Parsons says. The key architectural drivers for this system all had to do with having a solid communication infrastructure. The client could not afford to lose track of a single message entering the system.

Contrast this with another ThoughtWorks client, which offered a sandwich-ordering system as a perk for its employees. While the system did not store any credit card information or other sensitive data, it had been built with a level of security and reliability more appropriate to medical record or financial systems. “The things that matter vary dramatically based on the kind of system you’re building. It costs a lot of money, for example, to have five or six 9s of reliability.” A six 9s (or 99.9999 percent) level of reliability means a system is down for no more than 32 seconds a year. “Not everybody needs that level of reliability, and if you don’t need it, you shouldn’t pay for it,” Parsons says.

What constitutes good code stays the same whether that code is for a trading business or a sandwich-ordering system, but what constitutes good architecture varies widely. It’s all in the choice and balance of different “-ilities,” as Parsons calls them: reliability, maintainability, scalability, flexibility, and so on. “You have to decide which are important to you and then use them to drive your architectural decision-making.”

Parsons’s philosophy on architecture is that it ought to be able to evolve. An evolutionary architecture, as Parsons and her coauthors describe it in their book, is one that supports incremental, guided change across multiple dimensions. It’s a moldable and deliberate structure that can support both new and legacy systems in an ever-changing technology landscape.

N26: managing microservices

To Patrick Kua, who spent 14 years consulting at ThoughtWorks and coauthored Building Evolutionary Architectures with Parsons and Ford, the mobile bank N26 was a product he could at last take ownership of. Like a traditional bank, N26 has to meet both user expectations and regulatory guidelines. Fully licensed in Europe, the bank launched first in Germany, in 2013, and now serves over 3 million customers throughout the continent. It became available in the U.S. in 2019, where it operates in partnership with Axos Bank. Kua joined N26 as CTO in 2017, when the company was four years old, and until recently was its chief scientist.

Around February 2018, during a period of hypergrowth, the company noticed that push notifications were, in some cases, becoming noticeably asynchronous. A bank like N26 uses push notifications a lot: for two-factor authentication, for alerting a user that a transaction has been made on their bank account, and so on. The problem fell to N26’s foundation-product team, which handles services that are triggered regardless of the business domain.

“Like every company, we have some legacy,” says Kua, including “a little bit of a monolith.” But N26 also had strong monitoring practices. The foundation-product team was able to build a model of the issue and say, with confidence, that N26 had three months at its current rate of growth before the problem became unworkable. The older functionality was part of this remnant monolith, and the existing capability would need to be migrated to a new service.

Instead of “writing to a database table to trigger a notification,” the team created a new service that would “send an event to a particular service to say, ‘Actually, please do this,’” Kua explains. This architectural change also allowed the team to tighten up security around push notifications, something they had been wanting to do for a while.

“We do testing internally, so we’re on a special whitelist so that we can eat our own dog food,” Kua says. When they were confident in their solution, they moved the bank’s entire user base to this new service, first in increments of 1 percent, then 5, then 10. The migration took a month to complete and a great deal of coordination between teams that relied on the service.

Today, N26 has about a hundred microservices, and as its backend has evolved, these services are written by different teams who own different services. To know what design changes are architecturally significant, Kua looks out for “system smells”—like code smells. These might arise if “you’re changing the interfaces between objects, which is the same thing across boundaries of teams,” like the type of change the push notifications required. Another indicator is when “things fall between or across teams,” Kua says. “That’s where we actually think about architecture across the whole system.”

One such case arose in October 2018. Early on, there wasn’t an agreed-upon format for how services threw errors to the Android and iOS system frontend. “[One] service tended to throw errors this way; [another] service threw errors that way,” Kua says. The frontend was being asked to understand multiple ways of handling errors. But “it takes more time to undo this because you have to coordinate across teams.” To address this, the company formed a working group to engage with people involved in the change and rewrite error handling across services. The process has proved so incremental and large-scale that it’s likely to take N26 another year to implement.

“The natural consequence of microservices architecture is that you can’t really predict how the architecture is evolving,” and no one person can be its guardian, Kua says. What an organization can do, he adds, is align engineering teams in a way that reflects the architecture, and enforce practices to constantly “monitor for unwanted behaviors [within or across teams] and pull things out.” The goal is to move nimbly but carefully. “For us, it means thinking about architecture as a living ecosystem.”

Zapier: changing the product, changing the core

Zapier takes a simple idea—if there is an event in this app, trigger an event in that app—and executes it extremely well. Conceived during a startup weekend in 2011, Zapier launched with a set of 20 apps (including PayPal and Dropbox) that weren’t designed to talk to each other but could do so when connected through Zapier’s integrations, called Zaps. Since then, Zapier has expanded its coverage to over 1,500 apps and has become the go-to automation tool for 3 million users. Most of these are small businesses automating repetitive tasks, like the entering of data from an order form into a spreadsheet, but Zaps can get creative and complex fast. In its early days, “the key architecture was not so much a technical architecture, it was more of a go-to-market product architecture,” says Zapier CTO Bryan Helmig.

That meant talking to users, being selective about which 20 apps the service would support when it launched, and building integrations fast, without being overly prescriptive about how developers went about it.

In 2014, the team started to feel that perhaps they had kept their architecture too simple. Through their qualitative research, they uncovered a desire from customers to do more—to have one trigger cause a set, or workflow, of actions. This came up in their quantitative research, too: Users would create a dozen Zaps with the same trigger, duplicative work that spoke to a greater need.

The idea was to address this need with a core product shift. Instead of a one-to-one automation, there would be a linear chain automation, one-to-one-to-one: multistep Zaps that had to be as easy to use as they were powerful. This meant Zapier would have to make its first major architectural shift. The team had to implement a directed rooted tree that could support an arbitrary number of steps in a Zap, while still maintaining support for the hundreds of thousands of one-to-one Zaps that were currently operational. Each step, the team quickly realized, would need to be entirely unaware and independent of the others. On top of this system would be an omniscient new workflow engine, which would take the Zap through each step to completion and house things like error handling, reporting, and more.

This is risky, resource-intensive work. “How can you align [larger architectural investments] with features and product changes that customers are really excited to have?” Helmig asks. Since this wasn’t a standard architectural shift but a core product change, it would need an entirely new editor on which users could design their own multistep Zaps. This meant work on the frontend started six months after the backend.

Once the new editor was designed with a downward linear flow and the backend was engineered, the team moved a few Zaps over to the new system and let them run side by side, expecting that there would be no difference between them because they were functionally equivalent. “Over months, we dogfooded this ourselves internally,” Helmig says. Then they began moving some users over to the new system, until they were confident it was reliable. By February 2016, they had migrated all users over and made multistep Zaps available publicly.

Unusually, Helmig says, the team decided to tactically overengineer their backend by 20 to 30 percent. With such a powerful workflow engine, they wanted to eventually have Zaps running in parallel. (If this condition is met, trigger this path; if not, trigger that path.) In December 2018, they matched a frontend to this feature and released it as Paths.

Today, this architecture allows Zapier to build features like Paths for its power users; in the future, the architecture will also have to support Zapier when it experiments with ways to introduce later adopters to automation and build for them.

Good architecture is necessary architecture

Whether because of changes in user, industry, or technology needs, software architecture will continue to evolve. Architecture cannot be stagnant, as each of the companies profiled here can attest. Rather, it is a living, breathing set of entities.

“Good architecture is really the necessary parts of architecture,” Helmig says. “Those are things like systems that are decoupled [and] teams that are decoupled so you can horizontally scale out your product organization. Good architecture is science.”

Great software architecture, on the other hand, is art. For everything that might change in a product, there are certain user needs that will stay constant. Amazon’s users will always want fast shipping, low prices, and lots of options; Zapier’s users will always want high reliability, quality integrations, and ease of use. When architectural decisions keep these unchanging user needs at their core, Helmig says, companies maximize product options for the future. Great architecture lays a solid foundation for the product that is needed now, but is malleable enough to support the product that might be needed tomorrow.

Buffer: acquiring a monolith

ThoughtWorks: rearchitecting for incremental change

N26: managing microservices

Zapier: changing the product, changing the core

Good architecture is necessary architecture

About the author

Artwork by

Topics

Buy the print edition

Continue Reading

Testing

Ipsita Agarwal

A test of meaning

Frontend

Ipsita Agarwal

Case study: Web components for screen readers

Frontend

Ipsita Agarwal

Case study: Mobile payments in India

Containers

Ipsita Agarwal

Case study: Launching an open government platform in Taiwan

Remote

Ipsita Agarwal

Case studies in building remotely

Reliability

Ipsita Agarwal

Case study: Resilience as adaptability at Freshworks

Mobile

Ipsita Agarwal

Interview: Claire Sibthorpe

Reliability

Ipsita Agarwal

Interview: Dr. David D. Woods

Planning

Ipsita Agarwal

On planning in public

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call