Resilience is a process: something you must actively perform, not something you check off a list once. It’s not the same as robustness, though the two are often confused. Computer systems can be robust against known failure modes. If we find bugs, we can fix them and write tests to catch those specific issues in the future. If we find ourselves forgetting a certain step in a process, we can automate or document it for next time. Robustness can help with failure modes that are already known or considered, but it’s only when we add the creativity and flexibility of humans that we achieve the resilience to respond to unknown unknowns.
In order to be resilient—to be able to prepare for and respond to those unknown failure modes—an organization needs three things. First, it needs the organizational learning skills to maintain and develop a body of shared knowledge. These skills—including systems thinking, team and individual learning, shared goals, and adaptability—prevent the wasted effort of addressing the same problems over and over. Second, an organization must have the tools, processes, and ambient psychological safety to communicate effectively within and across teams. Finally, it requires enough slack for engineers to be able to perform both proactive and reactive work as needed. Without sufficient slack, it’s easy to end up in a never-ending cycle of fighting fires, finding Band-Aid solutions, and dealing with growing technical debt. If people don’t feel safe owning up to mistakes or admitting they don’t know something, it’s harder to diagnose and fix issues. And without the ability to learn from past incidents, work becomes Sisyphean, with meaningful improvement feeling forever out of reach.
Resilience, then, must be encoded in culture. This requires an understanding of what culture is and how to change it. Many organizations use the word “culture” when talking about their shared principles, but this is largely theory; culture is what happens in practice. For our purposes, culture is the collection of behavioral norms, social scripts, incentive structures, and processes that implement a set of values for a group of people.
Consider the example of two organizations that share “uptime” as a cultural value. One organization might try to achieve its desired 9s by imposing strict rules, having binders full of checklists, and punishing or firing engineers who make mistakes or deviate from the approved processes. Another organization might document recommended processes but let engineers use their best judgement, relying on blameless postmortems to understand why people acted the way they did. Even if both of these organizations end up achieving the same amount of uptime, their cultures are very different.
Resilient culture doesn’t happen by accident. Without any goals or design processes, an organization’s culture will end up defined by those who have the loudest opinions or the most social capital. Organizations name, define, and document their values to try to prevent this, but—as illustrated above—defining values is only the first step. In order to ensure a specific cultural outcome, clear goals and a culture design process are needed.
Figuring out where to start can be challenging, and if your team or org has reached the point where it needs a significant cultural shift, it can feel like what needs to be overhauled is, well, everything. Start by looking for components of culture that can be directly manipulated, like designable surfaces. A designable surface is anything that can be changed in an attempt to reach a desired cultural outcome. Examples include career matrices and promotion processes, templates for procedures like incident response, custom ChatOps interactions, GitHub issue templates, and team structures. Multiple designable surfaces working together can have a significant impact on culture overall.
The culture design process should begin with establishing your organization’s definition of resilience, and setting and and understanding your team and company goals. For example, an SRE team’s charter might consist of three focus areas: infrastructure management, process ownership, and software development best practices.
The next step is to look at the areas where your goals or charter intersect with the resilience factors defined earlier. A structured way to approach this is to create a table with team goals on one axis and resilience factors on the other, filling out each of the intersections. You’ll want to answer two key questions for each intersection. The first is the goal state: What would a resilient organizational culture look like in this area? Your goal state doesn’t have to be picture-perfect—perfect, after all, can be the enemy of good—but rather one that everyone can agree is a significant improvement. The second is your current state: How do team members actually behave? Be as accurate as you can, especially if you find that actions differ from expectations.
Differences between how things are now and how you want them to be in a more resilient future should start to become apparent during this exercise. For example, resilience at the intersection of infrastructure ownership and organizational learning might require a set of clearly defined incident response and postmortem processes. If your team’s processes in these areas are inconsistent and undocumented, that’s an issue to address. The resilient intersection of software development best practices and communication could include code reviews and internal learning programs to share expertise between teams. Process ownership combined with slack could mean giving engineers time to work on side projects that reduce toil. There’s no one-size-fits-all solution. Don’t be afraid to brainstorm until you figure out what resilience looks like for you.
Once you’ve identified where you are and where you want to go, planning and prioritization can begin. If you’ve identified several necessary changes, using an impact effort matrix can help you figure out where to focus first. Each change should have a person or group who champions it, communicates its impetus and impact, and establishes a clear definition of success. This last point is why you need to first identify what resilience looks like: It’s easier to make lasting changes when you’re moving toward a positive goal rather than moving away from a negative state. And defining what constitutes “done” also makes it easier to share successes. That momentum, especially early in the culture change process, is critical to keeping plans on track.
Keep in mind that reaching a “done” state for an individual cultural change doesn’t mean you’re done working on resilience. This design process is a tool kit you can use to identify and respond to problems and challenges with timeliness and flexibility. Replacing one static culture with another might improve robustness, but it doesn’t lead to long-term resilience; enacting change to create a more dynamic culture, as part of an ongoing shift, allows your organization to respond to those unknown unknowns, both social and technical.
As your teams grow and change, so will your focuses and challenges. Going through the steps of defining and prioritizing resilience—and making these methods part of your regular organizational planning and life cycle processes—will enable you to build a growth-oriented culture that can keep learning, improving, and building resilience for years to come.