Technological systems, like society, are becoming ever more complex—with interdependencies that are difficult to trace, operating in environments that inevitably find their edges. Resilience engineering posits that to be dynamic, systems must be able to extend their capabilities gracefully and adapt their capacities when needed.
We sat down with Dr. David D. Woods, who helped found resilience engineering in the early 2000s in response to several NASA accidents, including the Space Shuttle Columbia disaster, for which he was an advising investigator. For 40 years Dr. Woods has worked to improve the safety of complex, high-risk systems in fields such as aviation, nuclear power, and critical care medicine. In this conversation, he explains the concepts behind resilience engineering through the lens of the COVID-19 pandemic and other real-world crises, and how we can build systems that can perform even under stress and surprise.
This interview has been edited and condensed for clarity.
Increment: What’s the difference between reliability and resilience as it relates to complex systems? In my experience, these terms get confused easily and often.
Dr. David D. Woods: Reliability is a record from the past—we can pull out different facets of how we’ve performed in the past and say we’re getting better on this criterion or that. The problem is that [reliability] makes the assumption that the future will be just like the past. That assumption doesn’t hold because there are two facts about this universe that are unavoidable: There are finite resources and things change.
Can you make the system not just more optimal or productive but able to withstand known risks?
Finite resources and change mean that the future will not be like the past. You need to be poised to adapt, and [you can’t do that] by just trying to invest in reliability. You have to think about robustness and resilience. Robustness then becomes: Can you make the system not just more optimal or productive but able to withstand known risks? If you understand a threat, how do you make the system robust so it will continue to work—or work in a gracefully degrading mode—in the face of that threat?
Now we’re getting to what really is at the heart of resilience, and that’s extensibility. How do you extend performance when an event challenges the way you usually work, challenges your boundaries? Events will arise which stress your system. Those events will find the edges in your plans for normal operation and in your contingency plans, [so] you need to find ways to stretch at those edges. Resilience as extensibility is the opposite of brittle reliability.
We build sources for extensibility to be able to extend performance when we don’t understand what the challenge is. We rely on a lot of cognitive, human, and collaborative mechanisms. NASA Mission Control practiced anomalies on space shuttles for the Apollo era all the time. In space, surprise is normal. [They weren’t practicing a] specific failure. They were practicing how to have extensive resilience in the face of an event they hadn’t [anticipated]. They were practicing teamwork.
System failures can arise from a combination of smaller component failures or the environment being complex and unpredictable. And yet there’s this persistent idea that we have to become better at predicting, modeling, and potentially excusing failure if the chances of that failure happening are very low. How do you make the case for true surprises?
The first way to approach surprise is from a reliability and robustness point of view, [in which surprise is] about the way consequences and frequency combine. If it’s low frequency [and the consequences are low], it doesn’t matter. If the consequences are high and it’s low frequency, then we need to do something. For example, [the nuclear power industry] was worried about this in the ’70s given public concern about radiation. [Nuclear accidents] are estimated to be very low-frequency events, but because they can be so catastrophic, you’re going to make a big effort to be prepared to handle them.
It’s a frequency-consequence combination. Everybody assumes that frequency just declines, that we have a normal distribution and the tail is small. But it turns out, in statistics, [we] look at what’s called heavy-tail distributions. We often underestimate what looked like low-frequency events. They’re actually much higher frequency than you think because the tails are heavy.
An example is Hurricane Harvey in Houston, Texas, a couple of years ago. That was the third year in a row that Houston had a “one-in-500–year” flood event. If you say [what we’re expecting] is a one-in-100–year flood event and you’re looking at a specific geographic location, the frequency data might [suggest that]. But now take space and time averaging. In the continental U.S. this year, how many one-in-100–year flood events will happen? There will be multiple one-in-100–year flood events this year, and that number is increasing. So you have to better prepare.
We screw frequency up because we get trapped in linear simplifications, and we miss trends of change in the world. And that is not remotely good enough.
The science underlying resilience says there’s a different kind of surprise, and that’s the dominant form we care about: model surprise. In other words, because of finite resources and change, you’re adapting to the world and trying to get a better match between your capabilities and the world you’re in. The possibility for the mismatch to grow, or to move around, is fundamental. You can think of this as an envelope: [Your system is] successful within that envelope, but it has boundaries. The boundaries move. They’re not static.
Model surprise will happen because the world keeps changing. So, very simply stated, viability [of a system] in the long run requires extensibility. The world will throw challenges that find the edges in your current system. If you can’t extend performance at the edges, you’ll end up with a brittle collapse.
Can we look at an example that demonstrates brittle reliability versus graceful extensibility in the way the COVID-19 pandemic was handled?
We saw a classic form of resilient performance extensibility in the early stage [of the pandemic]. With the novelty of the disease, there was a lot of uncertainty about the proper kind of care: Do you put [patients] on ventilators quickly or delay [doing so] even though their oxygen saturation is low? The guidelines before COVID suggested that you respond aggressively to low oxygen saturation in the blood. But [health care workers] had to learn that that could be an over-response. An ad hoc, informal communication network rapidly emerged among physicians trying to develop and understand how to best treat patients. These physicians had a readiness to revise and a readiness to respond.
For an example of brittleness, in [early spring 2020] the CDC was struggling with the novelty [of the disease] and trying to integrate information and send guidance to hospital systems about how to deal with [it]. But what happened? The CDC was sending updated guidance to hospitals multiple times a day. The problem that hospital systems had was how to keep up with these changing recommendations.
[Government jurisdictions and hospital systems] just weren’t set up as dynamic organizations, whereas an emergency room in a hospital is set up to be very dynamic. Viability requires extensibility. And extensibility has to be built before you’re in the challenge or change situation. Generating this capability during the change is much more difficult than if you generate it in advance.
Extensibility is a dynamic capability: We have to be able to design a system in such a way that it can adapt in advance of a crunch by anticipating that crunch. How does anticipation work? What’s the distinction between anticipation and modeling for brittle reliable systems and contingency planning?
Anticipation turns out to be critical. The classic result is anticipating a bottleneck or crunch, so you act now in order to generate the resources or response capability before the bottleneck hits you. This originally came from studies on how people adapt to high workload, like anesthesiologists who did dynamic stuff in an operating room. They were highly sensitive [to change]. The [absolute] probability of a crunch happening might be low, but they picked up signs that the probability had gotten higher.
People [tend to] discount evidence that challenges their model. Their model is being surprised by events in the world, but instead of being ready to revise, they discount the evidence. It’s [about] how sensitive you are to the emerging information that things don’t fit your model. If you’re waiting for definitive evidence that some new problem has arisen, the problem will be much bigger before you act.
Instead, the people who were good [at adapting] were sensitive. They were picking up early evidence that things might be different, so they monitored new channels. They interacted with other people to pick up what information they had. They changed their effort. Anticipation is very tightly connected to a readiness to revise.
You’ve previously written that every unit in a system, at whatever scale, has to have non-zero graceful extensibility. What do you mean by that?
If you have zero extensibility you’re maximally brittle. A unit can’t have enough [graceful extensibility] by itself.
To use a hospital example, in no unit—a clinician or clinician team—can we have enough ICU capability. No unit by itself can have sufficient graceful extensibility given the possibility for model surprise. And the reason is the same: finite resources and change. This is why an emergency room has the capability to adapt to patient crises, but only so much. At some point in a mass-casualty event it needs help from the rest of the hospital system in order to handle all the patients. It needs more personnel, it needs to expand the space it takes [up], it needs to facilitate interactions with the diagnostic centers in the hospital. So you have to have other interdependent units, and they have to be ready to adapt to help the unit at risk of getting crunched.
As I start to run out of the capacity to act as the situation continues to deteriorate, I need help from somebody who’s in the neighborhood, so to speak. The neighboring units parallel or above, sometimes even below in a network or a hierarchy, need to recognize I’m at risk of saturation and do something to help me.
You can see the breakdown of [extensibility] in the pandemic response. Early in the pandemic, some governors and states got slammed. They were quickly trying to get the public to cooperate with restrictions because they were afraid of overloading their hospitals. They got better cooperation because everyone [realized] we don’t want to have our hospitals look like Italy or New York. Then, once hospitals were able to adapt or it didn’t get that bad for certain jurisdictions, everybody went, “See, we don’t want to do this anymore, it’s not that bad,” and cooperation with activity restrictions dropped off.
That’s an example of what we call reciprocity. Without reciprocity, you can’t get that second layer of, “I have some graceful extensibility, but as I start to run out of capability, I need other parties to help me.”
What would you say to someone who reads this interview and says, “Alright, I understand the concept of resilience. I understand that my system has to have graceful extensibility. Where do I start?”
That’s contingent on the engineering system and the role they’re dealing with. The pursuit of efficiency—faster, better, cheaper—inadvertently undermines the sources of resilient performance. So the general advice is [to adopt] pragmatic but different engineering [practices]. It’s the balance between seeking optimality in the short run and building and investing in graceful extensibility [in the long run]. Those have to be balanced—they interact, they’re interdependent, and you need both.