Our systems are complex, and our world is chaotic. The past year has made this abundantly clear.
The systems we work on—some of which are lifelines for countless people, like the U.S. state unemployment websites that kept crashing in March 2020, as millions who’d lost their jobs applied for benefits—are often built on brittle technology. In fact, Rachel Obstler, VP of product at PagerDuty, shared in an October 2020 presentation that 39 percent of developers report they’re firefighting or focusing on unplanned work—100 percent of the time.
As engineers, we must be confident that every change we push to our systems is reliable, and we must understand how the critical pieces fit together. In years past, we’d only deploy a few times a year and only had to manage a few hundred servers. Today, many organizations are deploying dozens, if not hundreds, of times a day and managing multiple servers, cloud providers, and external dependencies. Operating at scale is hard, especially when it comes to complex and distributed systems. That’s why we have to be proactive about responding to failure, to continue giving our customers a great user experience when they need it most.
Enter chaos engineering.
Despite its name, chaos engineering isn’t a chaotic practice.
Despite its name, chaos engineering isn’t a chaotic practice: It’s the science of injecting precise and measured amounts of harm into a system in order to observe how it responds and build resilience. It’s not about breaking things just for the sake of it—it’s thoughtful, planned, and designed to reveal the weaknesses in our systems, safely. We can use it to prepare for the worst and weirdest scenarios, and to ensure our systems gracefully degrade when they need to and users can continue to use our applications when they need to. So, let’s dig into how it works.
Chaos engineering begins with asking questions—lots of them. Are we ready to tackle a high-traffic event? What happens if we temporarily lose connection to our user login database? How will a slowdown in the connection to our payment system affect the checkout flow on our website? Do we know what happens to our systems when our disk fills up?
It’s fine if your team doesn’t have answers just yet, or only has some theoretical knowledge. Asking questions is step one in chaos engineering, kick-starting a better understanding of our systems while strengthening our mental models of our applications and infrastructure. We can use the answers to these types of questions to assess our knowledge and determine our priorities for step two: a chaos engineering experiment.
Chaos engineering experiments can help us uncover circumstances and edge cases our mental models can’t foresee. They allow us to isolate problems and drill down into specific issues, which in turn allows us to build and ship more reliable features and products that meet our users’ needs while minimizing the risk of failure. We can make our systems stronger by experimenting on every element, from the database layer to the messaging queue to caching, all the way to the compute layer.
To construct a chaos engineering experiment, you’ll need a hypothesis and a defined blast radius, magnitude, and abort conditions. The blast radius is what you’re attacking, for example the number of hosts, a percentage of a service, or a percentage of traffic. The magnitude is the impact or intensity of your chaos, for instance a 5 percent increase on your CPU load or 200-millisecond latency. Abort conditions are the circumstances in which you’d stop the experiment, such as your application being unavailable, an increase in negative end-user experiences, or a violation of your service-level objectives.
When we validate the user experience under chaotic conditions, we can better optimize what we’re building.
Say we assume that when traffic increases, we’ll automatically have more capacity provisioned for our applications. To test this hypothesis, we can run a chaos engineering experiment (along with some load testing) to gather information that will help us maximize the resources our systems use.
Or, consider that we don’t always recognize how complex systems operating at a global scale will behave under suboptimal network conditions. A chaos engineering approach to supporting customers with slower internet connections might be to inject latency into the network calls to replicate the user being further away from the main data center where that service is operated, or injecting packet loss to replicate the network conditions of users in certain regions. When we validate the user experience under chaotic conditions, we can better optimize what we’re building to serve global users.
At its core, chaos engineering is about probing and anticipating failure in order to create applications with confidence that they can withstand unexpected or chaotic circumstances—and using those experiences to ready your team to handle incidents of any scale.
One way to really put your team through its paces is to run “fire drills,” planned events that validate your incident response processes. Start by communicating to your team that you’re running a fire drill, and decide whether you want to use a recent outage as an experiment or to replicate the system conditions to execute a runbook. Then, replicate the conditions by running experiments, and continue to iterate on them, expanding the blast radius and magnitude. The end goal—not just of fire drills, but of chaos engineering as a whole—is to have automated experiments run across all environments in a continuous way.
This process tests your team’s incident response plans to ensure they’re documented, complete, and ready to be executed when needed. It also will help ensure you have the tools and processes in place to keep the business running smoothly during an outage. As you run through the drill, make sure the links referenced in the runbook point to the proper tooling, your dashboards and log links are working, and you’ve documented everything your team needs to work through and resolve an incident, from communication protocols to the postmortem write-up template. If you find something missing, or something that doesn’t work, you have an opportunity to make changes before a real incident occurs.
As the team gets comfortable running fire drills, your mean time to detect and mean time to resolve metrics should start going down, and the team will gain muscle memory as they practice the skills to navigate incidents with more confidence, and therefore less stress.
Reliability at scale doesn’t mean eliminating failure. It means better anticipating and mitigating how it impacts our users, and improving how we handle it. As engineers, we’re constantly iterating, seeking to understand and improve our systems’ performance. If 2020 has taught us anything, it’s that we have to be prepared for whatever this chaotic world might throw our way. Chaos engineering helps us think about and prepare for the worst-case scenarios our applications, systems, people, and infrastructure might face.
Run chaos engineering experiments on everything, uncover the inevitable failures, learn from them, and build around them. Soon, you’ll have transformed reliability from myth into reality.