When a software application or system breaks, changes in important metrics trigger an automated alert that something is going terribly wrong. The alert is routed to an engineer who is on-call for the application or system–an engineer responsible for making sure that any incident or outage is quickly triaged, mitigated, and resolved–and delivered via a page from a paging app like PagerDuty. What happens after the engineer is paged is where the situation gets incredibly interesting and reasonably complicated: incident response is a panicked, measured, and carefully choreographed race to triage complex software outages in a matter of minutes. In the age of multi-billion-dollar software companies, the difference between mitigating an outage in five or fifty minutes can correspond to millions of dollars in lost revenue. When every second of downtime comes with a cost, a good incident response process is key.
Incident response appears to be one of the few areas in software engineering where the industry has converged on true best practices, but these practices are very far from having automated away the manual work.
To uncover the state of incident response in the tech industry, Increment asked over thirty industry leading companies (including Amazon, Dropbox, Facebook, Google, and Netflix) about their incident response processes. What came out of this survey was astounding. First, we discovered that all of the companies we surveyed have very distinct, unique processes for all of their daily engineering operations (such as deployment, development, and monitoring), but it turns out that they all follow similar (if not completely identical) incident response processes. In addition, even though we live in the era of software automation, and many companies boast about the successful automation of their engineering workloads, we found that incident response processes across the industry are almost completely manual and require extensive human intervention. Incident response appears to be one of the few areas in software engineering where the industry has converged on true best practices, but these practices are very far from having automated away the manual work.
The basic incident response process that every industry leader follows contains five distinct steps that happen when an alert is triggered and a pager goes off. First, the engineer who answers the page triages the incident to determine the basic facts about the incident at hand. Second, the engineer coordinates with others at the company about the incident. Third, the engineer (along with anyone else who was pulled in during one of the earlier two steps) works to mitigate the impact the incident is having on internal or external customers, without root-causing. Fourth, the engineer(s) involved work to resolve the incident after it has been mitigated, try to determine the root cause(s), and then develop, test, and deploy a solution. The fifth and last step is the process of following up on the incident–a process that contains a postmortem document, review of the postmortem, and a series of follow-up tasks that come out of the resolution and postmortem. Incidents are not considered to be “over” until every one of these steps have been completed.
Let’s dive into each of these steps, and see how various industry leaders implement them.
Somewhere in the system, something is broken. Metrics–which could be a number of errors, certain types of errors, response times, latency, or something else along those lines–hit a critical or warning threshold, triggering an alert. The alert is routed to a paging application, which forwards it to an engineer who is on-call for the system. A push notification or text message appears on the engineer’s phone, letting them know that somewhere in the system, something important is broken.
The engineer receiving the page is usually a developer who is on the team responsible for building, running, and maintaining the system. They’re on-call for the system, part of a rotation with the rest of their team, and they tend to share each on-call shift with another engineer on their team. They know the system better than anyone else–after all, they probably wrote the code that’s breaking.
As soon as the pager goes off, the on-call engineer has one job: to triage the incident and to do so very, very quickly. The engineer acknowledges the alert, letting the alerting system and everyone else on the rotation and escalation policy know that they’ve started working on it. The clock starts the second they acknowledge the alert, and doesn’t stop until the exact moment the incident is mitigated, the exact moment the incident is no longer affecting internal and/or external customers.
The engineer has a few minutes to triage the incident. They need to figure out what the problem is (very loosely, and typically they can only pinpoint what the problem might be), what the impact is, how severe the incident is, and who can fix the problem. Importantly, the engineer who is doing the triaging is not responsible for fixing the problem and most likely will not be able to fix it. Their job is only to triage, only to figure out enough to either know if they can fix it themselves or who else can fix it.
The companies we surveyed all approach the triage phase in similar ways. Most of them use the same paging application (PagerDuty), and the alerts that trigger the pages are usually part of a custom in-house alerting system that the companies have built. Most of them have runbooks that engineers can follow step-by-step to triage and mitigate incidents. Most of them have spent many engineering hours working to make sure that only actionable alerts are routed to the pager. Most of the engineers on-call follow very ordinary on-call rotations and shifts, partnering with another engineer on each shift (the primary/secondary model) and sharing the overall rotation with a larger team. Most of the engineers who answer the pages are the developers who wrote the code–the only exceptions are at Slack (which has a separate operations team on-call) and at Google, where the most stable systems (Gmail, Ads, and Search) are staffed by site reliability engineering teams who did not write the systems but are responsible for maintaining them.
All of the companies have dashboards displaying the systems’ behavior that engineers can use to quickly determine the state of the incident, and they all also have very specific diagnostic guidelines to make it easy to determine the impact and severity of incidents. All of the companies require the on-call engineer who answers the page to determine the severity and impact of the incident as part of the triage phase. Diagnosing impact and severity is more difficult at some companies than at others. At Dropbox, for example, determining incident severity is rather complex, so they’ve built an internal tool that guides engineers through severity classification step-by-step when they encounter an incident. Slack, on the other hand, uses the number of users who are unable to connect to Slack as their way to classify incidents: large numbers of users who cannot complete API requests or cannot connect and upgrade to WebSockets count as high severity, and low numbers of users who cannot connect classify as low severity.
Incident classification is extremely important at these companies, because what happens after the pager goes off and after the triage phase is almost always determined by the severity of the incident: who to contact, what to do, how to fix, and how to follow-up. As Sweta Ackerman, Engineering Manager at PagerDuty, points out: “It is key that the [severity] levels are predefined, so when an incident occurs, the on-call engineer only has to say ‘does it match this criteria?’, and if so, let’s trigger the appropriate response.” Some companies have predefined incident response procedures only for high-severity alerts, leaving low-severity alert response to the discretion of teams. At Amazon, for example, high-impact, high-severity events must follow the standardized incident response process, while individual teams are responsible for determining their own response processes for low-impact, low-severity events.
“It is key that the [severity] levels are predefined, so when an incident occurs, the on-call engineer only has to say ‘does it match this criteria?’, and if so, let’s trigger the appropriate response.”
The alert has been triaged, and the engineer(s) on-call now know the basic facts about the incident: they know (very loosely) what the problem is, they know the impact of the problem, they know how severe the problem is, and they know which teams or individuals can fix it. The next step in the incident response process is to coordinate. During the coordination phase, several things need to happen. The incident must be routed to the teams or individuals who can mitigate and resolve it, or the on-call engineers need to begin working towards mitigation and resolution if they are able to mitigate and resolve it on their own. If the former is the case, then the engineer on-call needs to communicate with the team or individuals who can fix the problem and let them know everything that is known about the incident so that they’ll be able to begin mitigation.
Once the incident is in the hands of the team who can mitigate it and push it toward resolution, the team needs to begin extensive documentation and tracking of the details of mitigation and resolution. If the incident and the response to it are not documented and tracked while the incident is underway, then doing any follow-up after incident resolution can be very difficult. Keeping all of this documentation and tracking in one or more centralized locations is important, too: it ensures that everyone across the organization who may be affected by the incident is kept up-to-speed on its status.
Any further communication that is needed will be determined by the impact and severity of the incident. High-severity, high-impact incidents that affect external users are assigned a special “incident commander”—an engineer or other technical individual on one of the on-call rotations who is deemed responsible for all communication and coordination around the incident. On the other end of the spectrum, low-severity, low-impact incidents that only affect internal users and have local scope, will require tracking and documentation but do not require extensive communication and organization.
Although all of the companies we surveyed have this coordination step as the second step in their incident response processes, the technologies, tools, and methods that they use to coordinate vary slightly.
During outages, coordination and debugging is usually done over chat rooms and/or video or phone conferences. Almost every company we spoke with uses Slack for chat coordination, and engineers are required to update everyone else via various dedicated Slack channels whenever there is an outage. Slack and PagerDuty appear to be two points of failure across the entire tech industry. Slack uses its own product for all coordination during incidents and outages, so what happens when it goes down? “To the best of our ability, we use Slack,” says Richard Crowley, Director of Operations at Slack, “[but] If we’re really having a bad day, we use Skype.”
“If we’re really having a bad day, we use Skype.”
As for video and conference calls, Datadog runs Google hangouts during outages, GitLab live streams their debugging on YouTube, and Amazon runs phone conference calls during incidents that are led by specialized “call leaders” who are trained to coordinate, escalate, and communicate during incidents. Many of these companies run “war rooms” during high-severity incidents. “During a serious incident senior engineers who are not on call may be called anyways,” says Ruth Grace Wong, SRE at Pinterest, “and people may form a ‘war room’ in the form of a video conference or engineers gathering in a physical room.” Netflix’s approach is similar: “Netflix has an incident room on Slack where most incidents are handled,” says Netflix SRE Manager Blake Scrivner, “[and] also has a war room on campus that is used during some larger incidents. A conference bridge is [sometimes] used during off-hour incidents.”
Ticketing systems are very common as part of the coordination process, enabling engineers to track and monitor their work as well as assign tasks to others who can help resolve the incident. The ticketing system is sometimes accompanied by a series of email updates to various teams within the company to update them on the status of the incident; the majority of the companies we surveyed send out email status updates to the entire company for high-severity incidents. The way that ticketing fits into the process is this: all informal debugging and communication is done via chat and video/phone, and any time something needs to be done or something is figured out, a task is created in the ticketing system.
Igor Maravić, Software Engineer at Spotify, explained to us how Spotify uses JIRA as an essential part of their incident response process: “[The] JIRA incident ticket should be constantly populated, during the incident mitigation work, with all of the data, like interesting graphs and CLI error outputs, that might come handy in understanding what happened. Having this data as detailed as possible helps us arrange effective post-mortems. After the JIRA ticket is created a clear update should be sent to the incident mailing group, so all the interested parties are notified on what is happening, what’s being done to mitigate the issue and…the estimate of solving the issue. It’s important for updates to be sent with regular cadence, so people have peace of mind that somebody is working towards resolving the incident.”
Many companies have found the incident commander role to be a key part of their coordination process. Who gets to be the incident commander is different for each company. At DataDog, whoever was the secondary on the on-call shift is promoted to incident commander and is responsible for all communication and coordination around the outage, while at PagerDuty, the incident commander comes from a separate voluntary incident commander rotation containing people in all roles from all areas of the company. PagerDuty has invested a lot in the incident commander role, says Ackerman: “The IC acts as the single source of truth for what is currently happening and what is going to happen during a major incident. The IC has been critical for allowing our engineering team to focus on resolution as quick as possible by owning coordination and communications with other teams and stakeholders. In order to do this under highly stressful conditions effectively, all ICs go through a long, rigorous training process, which includes tactics for managing emotions and not letting them get in the way of incident resolution.”
The goal in the mitigation step is only to mitigate. It is not to discover the root cause, it is not to resolve the root cause, and, importantly, it is not to fix the problem at all. The only goal in the mitigation phase is to work to reduce the impact of the incident, and to make sure that any clients/customers (external and/or internal) are no longer affected by the problem.
By this step, the problem is in the right hands. It’s being addressed by the engineers who have the ability and experience necessary to stop it from affecting more customers, from causing more damage, from costing the company any more money. It’s also being surrounded by extensive documentation and communication: there’s an incident commander updating everyone involved on its status, there’s at least one Slack channel going, there’s a ticket open to document its progress, and, if it’s a serious problem, there’s a video or phone conference call running.
Every company we spoke to has the same mitigation goal: stop the damage as quickly as possible, and don’t waste time trying to find the root cause. Phil Calçado, Engineering Director at DigitalOcean, says that “the main goal is always to get the system back to stability instead of trying to investigate root causes during outages.” PagerDuty’s Sweta Ackerman emphasizes the importance of this: “Our major incident response process focuses around remediating the situation, not fixing it. The idea is that we do whatever it takes to get the system working again.” Ackerman gives an example of how mitigation works in practice: “We do not fix root causes in the middle of an incident if it will take a long time. As an example, if a bad deploy caused a major incident, we instead roll it back to reestablish a healthy state, not necessarily fix the root cause. The root cause analysis is an important part of our post-mortem step, which comes later in the process.”
“The main goal is always to get the system back to stability instead of trying to investigate root causes during outages.”
Datadog CTO Alexis Lê-Quôc points out that engineers often are tempted to root-cause: “It heavily depends on the nature of the outage of course but after initial diagnostic, the goal of the responder(s) is to mitigate the incident and not investigate the root cause, which, being curious by nature, is always a tempting option.” He also says that “It is the role of the incident owner to make sure the priority is on mitigation. If enough people are involved, they can form a parallel team to investigate in parallel.”
While the mitigation goals and approach are the same at these companies, the mitigation strategies are completely unique–a fact that isn’t all too surprising when considering that mitigation steps are almost always system-dependent and system-specific. Andrew Fong, Director of Engineering at Dropbox, points out that incidents will be different for different parts of the Dropbox ecosystem, and so Dropbox has had to build special programs that allow the systems to degrade rather than experience outages: “[Depending] on the software, [the mitigation strategy] could range from bug fix to failing over facilities. In general we try to have escape hatches that allow us to degrade vs go down…In a world where you support web, mobile, and desktop applications which can fail at any point you need to have processes and technology which allow for escape hatches at each point.”
Netflix has perfected its mitigation strategies by running chaos testing in production, constantly breaking their services on purpose to make them more fault-tolerant. Chaos testing in production is a common cause of incidents at Netflix, and they’ve honed their tools carefully over the years so that the chaos tests that bring Netflix down also make it stronger and make problems easier to mitigate. “Netflix has developed and open sourced Hystrix which is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable,” says Blake Scrivner. “This allows for most microservices to go down without impacting key customer experiences. Netflix also developed Chaos Kong, which allows us to evacuate a sick region quickly and lessen the impact.”
Slack, on the other hand, finds that the majority of their incidents are caused by problems introduced by new deployments/changes to the system, and that consequently their most effective mitigation strategy is to roll back to a known stable build or known system configuration. “Many incidents are correlated to a change released in one of our systems,” says Richard Crowley. “In these cases the most common response is reverting a change or rolling back to a known good configuration. Incidents caused by outside factors often require a unique response that varies across incidents.” Slack has also found that their biggest assets in mitigation are their provisioning and re-provisioning tools, which allow them to quickly mitigate “everything from hardware failures to Availability Zone failures to sudden and unpredicted traffic increases”. They’ve also built custom tooling to “to migrate WebSocket connections for teams and channels around to balance load, isolate failures, test new versions of code,” and they’ve had to implement “cache coherency tests and variously fine- and coarse-grained cache invalidation tools.”
Once an incident has reached the resolution step, it is hopefully no longer impacting any external users. Sometimes, however, mitigation only stops the incident in its tracks, preventing it from further impacting external users. This latter situation tends to be the most common at many companies, and in both cases the incident is not yet resolved, only mitigated and the root cause of the problem is still unknown. Many companies, like Google, have specific time constraints that teams must take into account when mitigating and resolving incidents: some allow the mitigation phase for high-severity incidents to last no longer than five minutes, others stretch it to thirty minutes, and much smaller companies tend to push it to several hours; once the mitigation phase is over, the clock starts ticking for resolution, and many companies carefully measure both the mean time to mitigation (MTTM) and the mean time to resolution (MTTR). Time is of the essence.
It is at this step in the incident response phase, and this step alone, when the engineers can finally investigate the root cause and try to fix the problem. The process here is simple, but it has to be undertaken with great care; specifically, fixes that resolve the incident should be treated with the same caution and care as normal feature deployments, and hotfixes that bypass good development, testing, and deployment practices should be avoided. The root cause should be identified, which often takes the majority of the time. After that, a fix for the root cause needs to be developed, and then this fix needs to be tested extensively, and then it needs to be tested and developed again and again until the engineers are certain that the fix will not introduce any new problems. Once the engineers are confident in their fix, they can deploy it carefully to production.
The resolution phase is an almost entirely manual process. The engineers at some of the companies we spoke with have no way of knowing that incidents are resolved (and even in some cases, mitigated) without carefully watching graphs in dashboards and combing through logs. “When do you know an incident is resolved?” we asked several companies. “When the indicator that triggered the alert has returned to normal or stabilized,” replied Airbnb. “Data, primarily through dashboards and alerting,” said Amazon. “We can usually validate the fix with our support staff, as they are always in contact with customers who were experiencing any issues,” replied DigitalOcean. “Graphs typically,” said Dropbox. “[It’s up] to the IC,” replied New Relic. “We look at graphs and make sure they have gone back to normal,” said Pinterest. “We tend to compare the current [WebSocket] count to the count right before the incident began, to the same time yesterday, and to the same time last week,” replied Slack. “We determine resolution through a combination of our monitoring sources, manual verification, and customer feedback via the support team,” said Shopify. “Incidents are declared all-clear once the service goes back in pre-incident state,” replied Spotify.
“When do you know an incident is resolved?” we asked several companies. “When the indicator that triggered the alert has returned to normal or stabilized,” replied Airbnb. “Data, primarily through dashboards and alerting,” said Amazon.
Blake Scrivner at Netflix gave us some insight into how Netflix thinks about incident resolution. “We look at two primary features when thinking about incidents – stabilization and resolution,” he said. “We consider an incident stabilized when our customers are able to use the service again. We may have some duct tape and bailing wire in place at the time, but our customers are operating. Once we’ve done the work to ensure that we won’t fail this way again and we’ve documented the outage, we consider things resolved.” Given Netflix’s dedication to fault-tolerance, making sure the same problem won’t happen twice is extremely important. Incident resolution doesn’t happen at Netflix, he says, until “there is a reasonable amount of confidence amongst all involved parties the issue will not recur in the near future (or ever).”
New Relic, like Netflix, is determined not to let incidents happen twice. “We also have a Don’t Repeat Incidents (DRI) policy that dictates a lot of our incident follow-up,” says VP of Engineering Matthew Flaming. “The DRI policy, which all teams follow, states that for all incidents that cause an SLA violation, all merges to master for that team are halted, except changes directly related to fixing the root cause of the incident. Of course, common sense applies: in cases where a fix is infeasible, reasonable steps to reduce the likelihood of a repeat incident, or the severity of impact if the issue does recur, will suffice. In other words, we don’t let teams move on and deploy other things until they’ve taken steps to fix their broken windows or risks that have led to an incident. We’ve found that being explicit about this is very useful because it means engineering teams don’t have to worry about where to prioritize work to address reliability risks vs. other things – the right answer is baked into the system.”
After resolution, the system should be back to a normal, stable, reliable state. Even though the incident has been resolved by this point, it is not quite over. The last step in the incident response process is perhaps the most important: it’s when and where the incident is examined, discussed, and learned from. There are a few distinct parts of this last step that industry leaders follow: they write blameless postmortems, they make and assign follow-up tasks, and they hold incident review meetings where all high-severity incidents are reviewed and discussed. Importantly, incidents are never considered over until all follow-up tasks have been completed.
Postmortems are documents that detail the who, what, where, when, and why of incidents after they occur. A good postmortem will include the following information: a short description of the incident, an evaluation of its severity and impact, the timeline of the incident, an account of how it was triaged and mitigated and eventually resolved, a list of things that went well during the incident response, a list of things that went poorly during the incident response, and a list of follow-up tasks (usually things to clean up after the incident). Postmortems are almost always written by the engineers who were on-call when the incident began, but can be written by a combination of the triage, mitigation, and resolution teams if there were multiple teams involved. The information contained in a postmortem is pulled from the communication channels that were used: the chat logs, recordings of the conference calls or videos, and the tickets that were made to track the incident. Good postmortems are blameless, meaning that they don’t point fingers or assign blame to anyone involved, even in cases where an engineer clearly messed something up; many postmortems do not even mention names of the engineers involved.
When their engineers were writing postmortems, Airbnb found it was very difficult to keep track of what was happening during an incident even with all of the centralized communication channels, so they built their own custom tooling to fix this problem. “We built our own incident tracker tool,” says Engineering Manager Joey Parsons, “to track important metrics about an incident (how long it lasted, how it affected users, tagging/categorization).” At New Relic, says VP of Engineering Matthew Flaming, engineers ran into the same problem and also built a custom tool: “A key part of our incident response and follow-up process is tracking the incident impact, scope, and remediation in a centralized tool so we can do analysis on things like common root causes. For each incident we also identify follow-up actions – either cleanup or remediation – that are tracked in the same system. These follow-ups usually surface as part of the incident retrospective, but can also come from the input of other stakeholders or teams.”
Datadog CTO Alexis Lê-Quôc says that they’ve learned to track various key things about their postmortems over the years. “We keep a running tally of postmortems in a simple spreadsheet to identify trends on various axes: (i) Technical cause: one of lack of testing, untested failure, deployment failure, etc., (ii) Philosophical qualms about the existence of true root causes, the goal is balance a deep but time-consuming analysis (e.g. Challenger postmortem) with actionable insights to direct the teams’ immediate focus, (iii) Component where the incident/outage started, (iv) Whether the incident/outage is a re-occurrence or not.” He says that this analysis of postmortems has made quite a difference. Pagerduty Engineering Manager Sweta Ackerman says that PagerDuty has instituted something very similar: “PagerDuty maintains an archive of all post-mortems, which has the nice side effect of a deep and rich incident management history. Keeping this archive also allows anyone, before a new incident starts, to review and see if this is a new problem, one we had before (hopefully not), ideas for quick resolution, and whether we are learning from our previous issues.”
When it comes time to review postmortems for high-severity incidents, some companies are simply too large to review all of the incidents in the review meetings. To combat this problem, AWS does a “pick a service” review in their meetings using a “chore wheel”: once a week, their engineers tell us, management holds an operations review where they spin their chore wheel and randomly pick a service in AWS to review; the process keeps all of the teams on their toes and is incredibly effective, engineers at AWS say. At Datadog, the CTO picks the most interesting postmortems from the month and holds a monthly review meeting, using their postmortem analysis spreadsheet to pick out interesting and relevant aspects of the incidents.
AWS does a “pick a service” review in their meetings using a “chore wheel”: once a week, their engineers tell us, management holds an operations review where they spin their chore wheel and randomly pick a service in AWS to review.
Some smaller companies are able to take a more democratic approach. Spotify, for example, lets the on-call engineers take the lead. “The on-call engineer who responded first to the incident will organise a post-mortem for all the parties involved,” says Igor Maravić. “During post-mortems we analyse the incident timeline and through discussion we come up with the possible remediations. It’s up to the team to decide how the remediations are going to be prioritised.”
PagerDuty runs review meetings for all high-severity incidents, and Sweta Ackerman gave us more details on how PagerDuty structures this very last step in the incident review process: “PagerDuty runs a post-mortem for all major incidents (for us, this is severity one and two incidents). Additionally, individual teams are encouraged to run post-mortems for any issues they feel they need to understand better. [W]e have post-mortems after the incident is resolved to follow up on action items, perform root cause analysis, and discuss improvements to our processes and systems. For major incidents, our post-mortems are also made public so we can share our learnings with our customers and industry at-large.”