Your sales database just went down. Your source control is in a crash loop and a thousand software engineers are reading Twitter while they wait for it to come back online. You pushed a change that’s routing all of your North American traffic through Azerbaijan. Whatever just happened, you’ve got an incident on your hands. What happens now?
The beginning of an incident can be chaotic, especially when it involves multiple teams. If responders aren’t used to working together, they’re likely to talk over and past each other as they independently discover the same spiky metric or smoking log line. Suddenly the graphs show a spike followed by a dip, and the investigation takes a sharp turn—until the logs reveal someone has tried to restart the service. This well-intentioned but misguided attempt to help deleted a log another responder was running a debugging tool against—which means they’re going to have to restart their analysis.
A team member offers an insight that could shave 30 minutes off the outage, but they do so in a Slack thread nobody is paying attention to. (Later, their “I told you so” will not win them any friends.) Meanwhile, engineers are calling out facts using terminology only their team understands; scared of seeming incompetent, nobody asks for clarification. Stakeholders are asking increasingly strained questions about when the situation might be resolved. The engineers who best understand the system are too busy replying to focus on debugging.
And the incident is still going on.
If this sounds familiar, it’s because most growing organizations go through a phase of this kind of chaos. Incident response might only be a small part of any company’s reliability strategy, but it’s still important to commit to getting it right: Sloppy responses can prolong your outages, damage your credibility, demoralize your teams, and erode your customers’ trust. In the worst cases, the response can cause more disruption than the problem you’re setting out to solve.
Great incident response alone isn’t a panacea—you need to invest in resilient systems to reduce the number and severity of outages—but once an incident is underway, human behavior becomes a crucial component of those systems. The way responders react plays a big role in determining how quickly and smoothly the situation is resolved.
In 1970, Southern California had one of its worst wildfire seasons on record. Despite rapid response from the U.S. Forest Service, multiple fire departments, and other agencies, the damage was devastating. After the fires, the Forest Service worked with the other institutions to identify some of the weak links in their coordinated response. These included agencies planning and operating independently, a lack of shared terminology, hierarchies that made it difficult for teams to cooperate and communicate effectively, and siloed information and resources, which led to logistical messes where fire trucks from different departments passed each other on the way to fires that would have been closer to the other department.
The fire protection agencies set out to “make a quantum jump” (in the words of the Forest Service’s 1973 “FIRESCOPE Program Charter”) in their ability to coordinate and allocate resources during fires. Their scope later expanded to encompass all risks and hazards, and in 1974 the Incident Command System (ICS) was born as a simple but effective structure that could handle any kind of major incident.
While the ICS is a complex system that most of us in tech will never use in full—it includes defined locations, templates for assigning radio frequencies, frameworks for objections, and a hierarchical chain of command with 36 separate named roles—many software organizations have adopted parts of it for managing their own incidents. In particular, they’ve found value in having a dedicated incident commander (IC—not to be confused with that other common IC, individual contributor), someone who takes charge of the situation instead of fighting the fire.
While tech outages (usually!) aren’t literal fires, they can also benefit from having a responder focused on leading the incident rather than actual debugging.
By setting up a chain of command and laying down rules, the IC provides structure for the incident response. Everyone knows who to talk to for the most up-to-date information, as well as who gets the deciding vote when there are disagreements about how to move forward. When a situation is murky, the IC is expected to ask questions, clarify jargon, and make sure responders understand each other.
Staying focused on the user
Nominating an IC means there’s at least one person who won’t get distracted by the concerns of any particular team. While other responders may get wrapped up in solving technology problems, the IC should keep the bigger picture in mind, advocating for whichever solutions or mitigations reduce user impact and shorten the incident.
Supplying status updates
The IC tracks and documents the current customer impact, leads being explored, reminders to return to later, points of contact, decisions that still need to be made, and any other high-level concerns. They’re often responsible for making sure customers and stakeholders get regular updates about what’s going on, as well as fielding interruptions from people who aren’t involved in getting the systems back online. If this ends up being too time-consuming, they might delegate these responsibilities to a communications lead.
Coordinating with others
Responders can make a situation worse by not working together. The IC makes sure the various actions being taken aren’t in conflict. They have the authority to shut down what fire commanders refer to as “freelancers”—bystanders or firefighters from neighboring departments who endanger other responders by joining the fight without coordinating first.
The IC can set the rules of engagement for communication channels, such as moving all debugging to a central channel or redirecting side conversations elsewhere. They can also designate a single operations lead for each team and ask all other members to send their updates to that lead, who then reports back to the IC. This reduces the noise in the channel and makes it less likely that important messages will be lost.
Taking care of responders
Most people aren’t good at collaborating or interpreting information when tired. The IC should be alert for fatigue, crankiness, or conflict and intervene when needed, including finding new folks to take over if responders (including themselves) have been running for too long.
For incident command to be effective, it needs to be a consistent process that’s well understood throughout the organization, with clear leadership support and encouragement. ICs should feel confident that if they step up, everyone else will understand what they’re doing and why. When introducing an incident command process to an organization, be clear about how you expect it to work.
Decide which incidents merit an IC
This will depend on the organization, but they might include things like user-visible outages, issues that affect more than one team or have many stakeholders, or issues that span an extended time frame and require consistent, coordinated communication.
Be clear about who your ICs will be
This might be a dedicated IC on-call rotation, a pool of trained volunteers, or an expectation that everyone above some level of seniority will be prepared to step in when needed.
Make sure everyone knows how to work with an IC
There should be a well-documented process for when and how to call in an incident commander. Everyone should be clear on the responsibilities of the role and the scope of the IC’s authority.
Someone stepping in to address a crisis and saying “I’m Batman” doesn’t help unless people have bought into the idea of Batman.
Most of all, having an incident commander only works if everyone believes in the role. Someone stepping in to address a crisis and saying “I’m Batman” doesn’t help unless people have bought into the idea of Batman. Otherwise, it’s just one more person in the room (maybe in a ridiculous costume) adding to the noise. The IC needs enough authority to be able to move a team’s communication to another channel, delegate bystanders to go on a side quest to collect log output, or instruct “freelancers” to stop trying things. Their ability to coordinate the response only works if other responders are engaging with the process and with them.
Like all new processes, incident command will require some culture change, and perhaps a healthy amount of getting it wrong before you start getting it right. You can establish expectations through training, documentation, presentations at all-hands events, and emails to the company, while disaster simulations can help ensure everyone is well-practiced at calling in an IC and following their lead. Incidents should be visible to all, perhaps in a dedicated incident channel, and incident retrospectives should (blamelessly!) include specific failures in the process, such as information the IC couldn’t find or responders who didn’t check in.
To be successful at incident command, you need two foundational skills: being comfortable telling people what to do, and being comfortable asking questions—even questions that might seem “obvious.” These skills tend to be found in more senior people, but anyone can take on the role if they’re willing to take charge. For those who aren’t quite ready, coordinating smaller incidents can help build that comfort. It’s good leadership experience—not to mention a great way to learn more about your systems end to end.
The most important aspect of the incident commander’s role is to be clear and consistent about what you need people to do, and to make sure they see others doing it. The more bystanders see someone say, “I’ll be the incident commander,” the more it becomes the cultural norm. Over time, responders will start expecting an incident commander to step up, and will call for one when an incident breaks out.
So step up. When you’re surrounded by sirens and flames and chaos, resist the urge to be just another person fighting the fire. Declare yourself the incident commander and take charge.