Once upon a time, I was on call every three weeks for a rotation during which I was reliably paged at 9 p.m. each Tuesday through Thursday night. New deployments were done at this time, and with the company still using some legacy tooling, something almost invariably went wrong. Often, the fix was simple: Restart the tool, rerun a script, and—voilà!—things would work.
But one fateful night, we ended up with about 100 people on the incident call before the fix was made, and everyone went to bed at 2 a.m. It was an engineer’s worst nightmare, a once-in-a-blue-moon fluke… Until the next night, when it happened again—and then again the night after that. The cycle continued for months, and we got so buried in dealing with the incidents themselves that we could barely pause to reflect on what we could have improved or done differently to prevent these issues.
From issue 1
What happens when the pager goes off?
To discover the state of incident response across the tech industry, we surveyed over thirty industry leaders (including Amazon, Dropbox, Facebook, Google, and Netflix) about their incident response processes.
Unfortunately, that on-call experience mirrors those of many other engineers at companies of all sizes, from startup shops to large enterprises. I am lucky enough to now work for PagerDuty, a company that not only helps to reduce these pain points, but is also a thought leader in digital operations and major incident response. At PagerDuty, a critical component of the incident response process is the learning and follow-up phase. It’s also one of my favorite parts of the process—the time when everyone gets together to reflect and have conversations about how to improve both the incident response process and the technical services and infrastructure.
Incident response also yields some essential documentation. One avenue for driving continuous improvement is through the post-mortem process. Post-mortems aren’t just meetings—they’re also documents that detail the Five Ws (who, what, where, when, and why) of an incident and help teams to garner actionable insights on how to make improvements. If done well, a post-mortem can be a powerful tool for both current and future teams.
A good post-mortem process is broken down into three major parts, the first of which will usually take up the bulk of your time:
Writing a post-mortem.
Reviewing the post-mortem and publishing the post-mortem.
Tracking the post-mortem.
Let’s go through each step in more detail.
Writing the post-mortem
The main goal of writing a post-mortem is to capture the timeline of events and the impact of an incident so that it can be presented in a subsequent review meeting. Fittingly enough, I’m a big fan of PagerDuty’s own post-mortem tool. Alternatively, a simple wiki template that’s easy to create and captures all of the fields listed here works. It’s also important to capture and save all post-mortems in a searchable place. (More on this later).
Some key highlights to include in the post-mortem are:
The timeline: This will constitute the majority of the post-mortem. Start by including important changes in incident status or impact to customers and any major actions taken by responders, engineers, or subject matter experts. Additionally, for each item, include a data source or metric (such as a DataDog graph, tweets showing customer impact, etc.).
Analysis: A simple summary of what happened. This should capture the underlying cause of the incident, how many customers were affected, and the overall impact on customers (e.g., what functionality was degraded or affected).
Action items: List the actions that were identified and undertaken during the incident, as well as any necessary follow-up tasks. These action items should be captured in the post-mortem so that they can be assigned later on.
External messaging: Assuming this was a major incident, draft the external messaging to customers, recapping some of the details above.
Having an easily searchable record of past incidents allows you to quickly look at similar cases and even reference specific graphs or data points.
Reviewing the post-mortem
Once you’ve filled out the post-mortem template, send it out to all parties ahead of the post-mortem meeting. Key stakeholders to invite to the meeting include the Incident Commander (IC) and any ICs-in-training; technical service owners; key responders, engineers, or subject matter experts involved in the incident response; and, for major incidents, a customer liaison. Invite all members to leave comments or make edits to the report, especially to the timeline portion.
If everyone has had a chance to review and edit the post-mortem timeline ahead of time, the post-mortem review meeting itself should only take about 30 minutes. However, you may prefer an hourlong meeting for longer or larger incidents. Regardless of length, the post-mortem review meeting should focus on the following:
Alignment on the timeline. Quickly recap and review the timeline and ensure that everyone is on the same page.
Discussion of how the problem could have been caught. Capture any new action items along the way.
Discussion of customer impact and the external messaging, if needed.
Review and assignment of action items, along with ETAs.
Publishing the post-mortem
Once you’ve completed the post-mortem review meeting, there’s one final but important step you have to take: publishing the post-mortem. Distribute the post-mortem as an internal communication, typically via email, to all relevant stakeholders, describing the results and key learnings and providing a link to the full report.
After some months of having a well-structured post-mortem process in place, you may find yourself with a list of post-mortem documents, ideally tracked in a wiki or another searchable tool. Why does this matter, and how does it help?
A pattern of similar or repeated incidents with the same underlying root cause can point to the need for larger architectural changes.
There are many benefits to having a detailed, searchable collection of post-mortems:
A list of post-mortems serves as a major incident log that can be used to inform future incident response. The next time you are in the heat of a major incident, the information you need may not be at hand. Having an easily searchable record of past incidents allows you to quickly look at similar cases and even reference specific graphs or data points. No more digging for old information in new places.
Post-mortems can help align the whole business by providing everyone access to the same information about an incident—a benefit no matter the size of the company. Once the post-mortem is published, the information within it can be used by many departments for a variety of purposes. For example, Sales can consult post-mortems when customers or prospects ask them about a past incident; having a log of these incidents will put the key messaging and details at the Sales team’s fingertips. Or, Finance can consult a post-mortem to evaluate the impact to the customer in case credits need to be issued for a service degradation. And so on.
Post-mortems provide a business case for technical reinvestment. Having a rich post-mortem log allows engineering team leads to more easily inspect which parts of the technical architecture might need some reinvestment. A pattern of similar or repeated incidents with the same underlying root cause can point to the need for larger architectural changes. Post-mortems contain all of the data an engineering manager needs to help get buy-in and alignment from Product counterparts, as well as other teams that may need to spend time working on fixing issues in the longer term. Post-mortems are a great way to bring awareness to these issues and quantify them in business-speak.
While it may seem like the creation of post-mortems as documentation takes a lot of time and investment, in reality the effort is quite minimal compared to the time and money that is lost when companies remain mired in major tech debt or disorganized incident response processes. I look back on my time spent putting out fires on call, and I think about how different things would have been if we had recognized the value of post-mortem documentation sooner. We could have saved many hours of lost sleep, fostered a culture of continuous learning, delivered better software, and saved customers a whole lot of pain.
If you find yourself in a similar situation, you don’t have to reinvent the wheel. Take advantage of all of the great work that has already done by people who have lived through this. It will really help you to set yourself, your team, and your company up for success.