From source control through continuous build, integration, testing, and deployment, a software delivery system is the collection of systems and practices that take code from idea to production. A good one will enhance the experience of working at your company. A bad one—or worse, none at all—will be a constant source of toil and frustration.
There’s a temptation to jump straight to discussing how software will solve these problems—and yes, software will be part of your solution—but you won’t succeed with software alone. Without an atmosphere of trust and psychological safety, your people won’t be able to do their best work.
In this article, we’ll examine the capabilities you’ll need from your stack in order to move fast without breaking things. But, more importantly, we’ll underscore that your engineers need to trust each other, and their tools, in order to ship code quickly and reliably.
Fast, frequent, safe deployments are the hallmark of great software operations. Why? Because these changes are easier to reason about, go wrong less often, and grant engineering teams energy and momentum. Although adoption varies widely, this best practice isn’t merely our own opinion or preference. The big players (FAANG et al.) are fully adopted; some organizations are just starting out; and many more are somewhere in between.
In their 2018 book Accelerate and 2019 “State of DevOps” report, Dr. Nicole Forsgren, Jez Humble, and Gene Kim interrogated the habits of engineering organizations. They found that what separated mediocre organizations from good or great ones was the frequency and reliability of their software delivery systems—their ability to push code. They also found that excellence in this domain bred further excellence, empowering the highest performers to keep getting better.
Forsgren, Humble, and Kim settled on five measures of software delivery and operational performance that correlated with high-functioning software organizations: deployment frequency, lead time for changes, time to restore service, change failure rate, and availability. Of these five measures, the first four explicitly answer the question, “Do you have a high-functioning software delivery system?” Availability, the fifth measure, is correlated with the other four—do those right, the authors wrote, and you’re likely on the right track for them all.
Doing the work we recommend in this article won’t get you all the way to reliability, but it will make the road smoother. Every step you take to improve reproducibility, automate the fiddly bits, and delegate authority is a step toward improving your velocity. Every step can reduce developer burnout, increase development teams’ momentum, and free up engineering capacity.
Let’s start by unpacking what a “high-functioning” software delivery system looks like.
It’s reliable. When you reach for it, it’s always there. Using the software increases confidence in the software.
It’s repeatable. It’s the same every time—monotonous and implacable in its rhythm. This makes it easier to reason about and debug.
It’s reversible. Whatever you do, you can also undo and redo. (Build an “undo” button so it’s easy to get out of any hole you put yourself in—you won’t regret it.)
It’s fast. Speed breeds confidence. When a developer’s code ships in the time it takes to get a coffee and check Reddit, they still have the change’s context in mind when it hits production. This puts developers in a strong position to aid in troubleshooting and encourages smaller changes, which are easier to debug.
It’s hermetic. Everything required to recreate your code is available without recourse to third parties or dependencies outside the system. Combined with repeatability, this means code sent through the system on Monday should generate the same results every day thereafter.
If this is already your baseline, no property on this list should break your stride. If you’re already working on it, ask yourself: Do you know what code you’re running right now? Do you know who modified it last and what they did? Did someone ensure it was going to have the desired effect? Did you write down what that desired effect was? Your answers to these questions can be a jumping-off point for optimizing your system.
If this isn’t your baseline, you may be tempted (again) to leap headfirst into software solutions. But while many technologies can help us build software delivery systems, we must start by building a foundation of trust. Why? To move fast, engineers need to be empowered to make changes and innovate. To do so quickly and with confidence requires a high-trust environment.
Imagine a company aiming to ship code 10 times a day—a somewhat large but achievable number. To do so requires a manager’s delegation of trust to an engineer and their peer approver. The engineer can’t run every change by their manager before proceeding; that would defeat the purpose of moving with speed and efficiency. True ownership of software is limited—and exhausting—if an engineer doesn’t feel empowered to improve it. Every pain point must be within their remit to eliminate. Ultimately, you can’t safely be on call for a service you can’t fix. Trust is an enabling technology.
Aim to keep most of your users happy most of the time.
To solve your biggest problems, you’ll need to listen to your users—only they can tell you what those problems are. Aim to keep most of your users happy most of the time. Different teams will have different challenges; some you can help on your platform, some you should help leave your platform and seek support elsewhere.
Understand the limits of your service. The team that owns the delivery pipeline builds and tests less software than the engineering org combined. Your users know better than you what your system can do. In computer security, “arbitrary code execution” is a scary breach; in delivery systems, it’s the service you’re offering.
Give users guidelines for how features should work. How many deploys should a service be able to run in parallel? How many tests? Your users will find the edges of your capacity, so learn to love them for it—they’re going to do it whether you want them to or not.
Retire redundant services. Last year, Squarespace’s delivery team owned seven services that conceptually did the same thing. The team did their research, made sure they understood the use cases properly, then replaced those seven services with one, eliminating untold toil. Running one service instead of seven put us in a position to make that service great. It was built to meet the needs of the business now and in the future.
Even with trust and a good delivery system in place, we will make mistakes. And with modern infrastructure, we’ll make them with unprecedented speed and breadth. That makes time to recover absolutely crucial: It’s okay to make mistakes if their cost is small.
Imagine a team of 200 engineers making five changes a week with a 1 percent mistake rate. For the purpose of this example, let’s say that’s a mistake bad enough to require developers to roll back the code. That’s 10 mistakes a week, 520 a year. If we give you back two weeks for the holidays, that’s 500 mistakes making it to production every year. If you have an uptime SLO of 99.95 percent, your error budget is 270 minutes a year, which means you’ll need to recover from each mistake in less than two minutes. Can you recover from a mistake in two minutes? Can you recover from every mistake in two minutes?
The lesson here is twofold: Make recovery, particularly rollback, blisteringly fast, and limit blast radius. Think about your software ecology and design to allow for mistakes.
What do all these practices look like when they’re working? Empowered, confident engineers making small, frequent changes. The emotional uplift and dopamine hit of getting something done with minimal drag. Small changes that are in users’ hands quickly, easy to reason about, and safe to reverse if needed. A transparent delivery system, set up so any engineer can find out how often each piece of software went to production, how long it takes to get a change out the door, and how fickle the pipeline is today.
When you commit to it, change will come. A 30-minute deploy 10 times a day will go from impossible to a slow day. High-functioning engineers with high-functioning software can do amazing things, and do them relentlessly.
Back in 2018, we were tasked with improving developer efficiency at Squarespace. For ideas, we looked to industry, our users, and the engineers who ran our infrastructure—and we came to the conclusion that few levers are more effective than making your deployments fast, safe, and reversible. When we started this effort, we were already committed to continuous delivery as an organization, but continued incremental improvement has allowed us to give our developers the tools to move faster.
Building infrastructure is uncertain work, but there are things you can build that are always useful. Wherever product and marketing teams want us to go, they’ll want to move quickly and with confidence. So if you’re unsure what to build, build a great software delivery system.