How do people—from ER doctors to air traffic controllers to software engineers—manage to keep complex systems working in the face of continuous challenges? This is the central question of resilience engineering.
Studies of resilience tend to focus on what resilience is and how it works. In contrast, resilience engineering seeks to enhance the resilience already present in a system. Since its emergence in the early 2000s, researchers have collaborated across fields such as human factors engineering and cognitive systems engineering to understand how complex work in hazardous domains across many different industries can succeed. And it’s been making significant inroads into the tech industry.
Since 2013, a growing number of companies have joined the resilience engineering community, in large part thanks to the SNAFUcatchers Consortium, which brought researchers from Ohio State University into the offices (and lunch rooms) of some of tech’s biggest names, including IBM, Salesforce, and Etsy. (Two of this article’s authors, Dr. Richard Cook and John Allspaw, are core members of the team. Dr. David D. Woods, interviewed elsewhere in this issue, is also a team member.) The collaboration paired university researchers with partner companies to examine, compare, and contrast their experiences anticipating and handling incidents in order to deepen their understanding of these critical events.
One result has been a burgeoning understanding of adaptive capacity—a person or organization’s capacity to adapt when circumstances change, such as during a major incident, a string of incidents, or organizational shifts—as a hallmark of resilience. This is explored in “Building and Revising Adaptive Capacity Sharing for Technical Incident Response: A Case of Resilience Engineering,” a case study by Dr. Cook and Beth Long published in Applied Ergonomics in January 2021.
This article, a summary and expansion of Cook and Long’s findings, will examine one company’s approach to incident response as a framework for understanding adaptive capacity, and highlight takeaways for organizations looking to adopt a resilience engineering perspective.
In their paper, Cook and Long detail the case of consortium researchers and engineers from a participating company who met to discuss a recent barrage of incidents that had taxed the company’s incident handling capacity and contributed to burnout among responding engineers. The salvo of incidents made clear that their established incident response processes weren’t working. Development teams each had their own on-call engineer responsible for handling incidents that affected local components and subsystems, but this strategy fell short in the face of multiple complex, overlapping incidents that were difficult to diagnose and resolve.
Recognizing the need for a new approach, a group of experienced engineers established a support cadre to help respond to high-severity or difficult-to-resolve incidents, serving as a deep technical resource that could be tapped to support incident response. (Successfully sharing adaptive capacity in this way, Cook and Long noted, depends on specific characteristics such as the rate of incidents—not too low, not too high—as well as their duration—minutes to hours—and their magnitude—a combination of minor and major.)
The group, which initially included eight engineers and engineering managers from various teams, self-organized an on-call rotation so anyone at the company could summon them to provide their engineering and operational expertise. An on-call support engineer would participate in incident response if an incident crossed certain thresholds, such as high customer impact or long duration, allowing the other members of the incident support group to concentrate on their own tasks when not on call. The group members met weekly to review recent incidents and discuss how they should adjust their approach.
Members were aware that their participation in the group took time away from their primary work, and they tried to build in backstops to avoid conflicts. For example, a team of five engineers with one member in the incident support group could expect that engineer to be focused on incidents about one week out of eight. This support engineer would schedule work for their on-call week that was both interruptible and less taxing than that of their teammates. The workload of the engineer’s “home team,” however, stayed the same; the on-call support engineer’s decreased productivity was treated as overhead.
In theory, the on-call support engineer would participate in incidents only occasionally. In practice, however, they usually monitored all incidents’ progress, effectively staying on “hot standby.” Being alert to active incidents meant they could come up to speed faster if and when one required their expertise.
The organization quickly benefitted from this new process. For one thing, some incidents were resolved faster: Bringing their expertise and diverse incident experience, on-call support engineers were able to help first responders identify and resolve problems more efficiently. For another, the incident support group relieved some of the strain of managing incidents with severe consequences: First-responder engineers knew a specific person would appear when an incident was severe or long-running, which helped lessen their anxieties.
Lastly, this approach reduced the “fire alarm” effect. Previously, a serious incident might capture the attention and efforts of many senior engineers, disrupting work across teams. Now, non-responders could safely stay focused on their own tasks, knowing an incident support group member was on call and would engage if needed. Group members who were not on call, meanwhile, could focus on their home team’s work for seven out of eight weeks.
Over the course of their first year in practice, the incident support group’s members continued to refine their processes. Initially, an incident commander would summon an on-call support engineer on an as-needed basis. But the group began to notice that responding engineers either delayed or avoided calling for help during long-lasting or severe incidents. The reasons varied: Some engineers got caught up in problem-solving and didn’t think to bring on extra support, while others were overconfident in their ability to resolve the incident without it.
In response, a group member wrote a program to track in-progress incidents and page an on-call support engineer under certain conditions, such as a particular duration or declared severity. This offloaded first responders’ request-for-help burden, and in some cases allowed the on-call support engineers to learn about incidents independently.
Over time, however, members dropped out of the group, which meant the remaining volunteers were on call more often, increasing the burden on their home teams. To lighten the load, the company built a pipeline to replenish the group. It began to recruit new members, offered training, established an apprenticeship program, and sought to make the role more attractive, for example by boosting its status within the organization (including with financial incentives), recognizing participation as part of career advancement, exempting support group members from their home teams’ on-call rotations, and establishing term limits in order to better distribute the workload across the org.
Managing adaptive capacity isn’t easy. It’s not an object that sits on a shelf, waiting for someone to unbox it. It’s ever-present, every day—and ordinary work, especially in an engineering organization, depends on it. It becomes most visible when under threat, its contribution to the system becoming clear at the edges of performance. In this particular case, the impending loss of that capacity motivated company leaders to devise a system to share it more easily and reliably.
The creation and continued evolution of this incident response approach can offer some critical learnings for other organizations. First, the company recognized the need to build and maintain a reserve of expertise it could summon in situations where “normal” incident handling practices were faltering. The company explicitly supported this reserve, sanctioning it with resources and recognition as a way of preparing to adapt when necessary.
Second, although the group’s value to the organization was clear—virtually no one wanted to return to the old way of handling incidents—building and sustaining this capacity came with significant trade-offs. Home teams whose members joined the incident support group couldn’t count on those engineers’ full attention to local project work. The extra support workload was also inconsistent: Some weeks, the on-call support engineer could work with hardly any interruption, while other weeks they might need to focus almost all of their attention and energy on a string of taxing incidents.
Ultimately, this trade-off of local productivity for more effective incident response was considered worth the cost, so long as that cost was small and evenly distributed. When members began to leave the group, that cost came to be shared among fewer people, with a smaller number of teams taking a greater hit to productivity. Sustaining their adaptive capacity required additional company resources, including the recruitment pipeline and incentives mentioned earlier. In this context, resilience engineering encompassed the organizational restructuring that was necessary to preserve the adaptive capacity.
Finally, the heterogeneity of the incident support group, comprising members from many different home teams, meant the volunteers learned a great deal about technologies and processes beyond their “home base.” Providing support across teams broadened their expertise as engineers and as colleagues, and rotating group membership with term limits and more aggressive recruiting helped extend this expertise throughout the organization.
The fact that resilience engineering can emerge organically, as it did at this company, suggests we can expect to find it elsewhere in tech—and in other fields, too. As this case study illustrates, the situations that foster it are those that tend to exhaust resilience. Constant demands can erode the adaptive capacity of a system and make resilience hard to sustain. Here, the engineers recognized a problem with their current incident response practices and worked to use their existing adaptive capacity more efficiently.
Adaptive capacity has to be nourished and renewed.
People working close to the sharp end of a system are often the ones to recognize the erosion of resilience and engineer temporary remedies. Ultimately, though, adaptive capacity has to be nourished and renewed. In this case, the company made an effort to do just that.
Simply adding adaptive capacity to an organization is likely to be difficult. For example, hiring seasoned experts may help, but even the most experienced new hire still needs time to learn about the system before they can offer support when usual problem-solving methods don’t pan out. Companies must husband their adaptive capacity; in this case, the gradual loss of incident support group members was a sign that the company needed to rethink its approach to recruitment and workload.
Notably, the group was mainly self-organized and lacked formal support from the company in its early days. Experts working on the day-to-day maintenance and repair of complex systems are generally more attuned to where adaptive capacity lives and how to extend it than those who work further from these systems. Here, they recognized a problem and took action, efficiently redirecting local resources to make resilience engineering possible. Only later did the company invest in the group more formally, which was likely a good thing: Hierarchical managerial decision-making might have been quite slow, frustrating the fast learning and flexibility that ultimately made the effort successful.
This company’s experience is hardly unique. The tech industry, after all, is awash in incidents. New methods to address them crop up daily, and computational tools to help with incident response and post-incident evaluation are becoming more widespread. But resilience engineering is about more than approaches and tools. It’s about preparing to be unprepared—anticipating and planning for future incidents, detecting when our ability to handle them is threatened, and adjusting our attention and focus as needed.
Being able to keep large, complex, technology-intensive systems running is a primary function of online businesses, and the unpredictable nature of incidents will continue to require expertise that is expensive and hard to develop. Resilience engineering—the effective management of adaptive capacity—will be a crucial tool for organizations seeking to navigate the choppy waters of incident response. We hope this article offers an inlet for those who want to explore it in their own environments.