Systems analysis through postmortems

Systems analysis through postmortems

Postmortems can help measure the unknown in working software systems. They can also inform changes that make those systems more resilient—and better understood.
Part of
Issue 12 February 2020

Software Architecture

If software is a “system of systems,” a nested hierarchy of components layered on top of each other, then software architecture is a model of that hierarchy and the interactions between its components. Crucial to designing successful architecture is knowledge both of the software itself and the world it serves—but this knowledge is at best incomplete and out of date.

“All models are wrong, but some are useful.” In 1976, statistician George E.P. Box coined this succinct reflection on the difficulty of designing systems that mirror the complexity of the real world. While Box was talking about statistical models, his maxim also applies to software models. All software is a best-effort model of the world, constrained by the limits of the hardware, the companion software with which it’s interacting, and the skill of its implementers. Even given a perfect understanding of both the world and our software model, no architect works in a vacuum. Instead, developers operate within a network of colleagues, teams, and stakeholders, each of whom may modify the software or how it’s used. Our architectures evolve, seemingly independent of us, and are more likely to break the more we iterate on them without checking the model’s accuracy.

Things inevitably go wrong. More often than not, shortfalls become glaringly obvious in that most uncomfortable way—systems failure. Whether it’s the planned downtime of a distributed lock service, an incidence of TCP ephemeral port range exhaustion, or a regex parsing library behaving differently than expected, our own carefully designed software systems can violate our expectations with dramatic consequences for our users and our businesses. Though often painful, these moments of failure are opportunities to discover the limits of our architecture. In these situations, programmers often use postmortems: mechanisms to unravel the threads that bind our systems together, identifying areas that can (and sometimes must) be reworked into a more resilient design.

Postmortems aren’t predictive, but they need not be used only in the wake of disaster. Just as service-level objectives (SLOs) can pick up on many different kinds of failure, postmortems can understand them. In fact, postmortems can be designed to highlight both large and small differences in the model of the software and the actual behavior of that software in the world. That knowledge can then be used to modify the application to better fit it to the real-life constraints of both users and production—a process that results in feedback-driven architecture.

On postmortems

It’s our singular task as engineers to design software that’s useful to our users. Yet I don’t think I’ve been able to design a single system that functions exactly as I anticipated when first exposed to its end consumer. I’m not alone in this: All of our software is shaped by personal biases and is inherently imperfect.

Whether we choose to implement a microservice architecture, deploy Kubernetes because similar organizations are doing so, use the RAFT distributed-consensus algorithm to design high-availability services even though our colleagues don’t have experience with it, or design our own observability toolkit because we (mistakenly) consider it a simple problem, we are almost guaranteed to make some poor decisions with every revision of our architecture. We seem to have a crystal clear view of the trade-offs we make when it comes to where we store, ship, and move our data, but when our lovingly crafted observability systems (for example) break in production, we are caught off guard and unable to restore service. At least hindsight is 20/20.

Despite the thousands of hours of text to read and talks to listen to on the subject, we can’t completely expunge complexity from our software architecture. Modern software enterprises are often running hundreds, if not thousands, of services in production, all of which interact with each other. Their software architecture is so large and complex that a single person can’t possibly know it all.

Postmortems provide an opportunity to understand, and improve, our software architecture because they measure our software model against reality. The goal is to collect as much information as possible about the system as it was, rather than as we imagined it to be. One example from my own on-call history is the complete failure of a customer-facing service. The service was a fairly large, complex monolith written in PHP and deployed on top of Kubernetes. The abbreviated time line went something like this:

12:45–1:35 p.m.: The on-call developer gets a page for the LATAM service and sees that the root partition is filling up with the service’s log file. After verifying much of the log contained the same Redis connection-failure message, they drop the log, restoring service.

1:45 p.m.: The developer notices that the “disk full” alert did not fire.

10:30 pm.: The LATAM service is paged again. Developer drops the log again to restore the service and buy time for investigation.

10:40 p.m.: Developer notices that the root partition is rapidly filling up again with the same log message.

11:50 p.m.: Developer learns that too many sockets were open in a different container after finding an article on TCP TIME_WAIT online.

11:55 pm.: Developer discovers that unlike the host, the container does not reuse TIME_WAIT connections over the loopback interface. They enable that reuse via sysctl and the system recovers.

12:20 a.m.: Developer goes to bed.

This failure was an interesting one. It uncovered a whole set of failed assumptions about the system’s expected behavior and use that made the architecture fragile, as well as some aspects of the system that went fortuitously well. We found that the following factors contributed to the failure:

  • Nodes that had become unhealthy were not destroyed and recreated automatically.

  • Redis was in the same failure domain as the web application.

  • Redis would fail when it couldn’t write data to disk, even though that data was transient.

  • Logs were being written to the root partition in the same domain as critical system files.

  • The log file could entirely fill up the disk, rendering it useless.

  • The behavior of the network was different in a container than it was in the root network namespace.

  • Self-hosted monitoring in a single cluster was not reliable.

We also uncovered the following mitigating factors:

  • Pingdom was effective at verifying the service outside of monitoring.

  • Because Linux has become the de facto web software server, it is possible to discover a problem with the system by running a set of commands in the terminal and pasting the output or error message into Google.

Contributing and mitigating factors are useful to determine how to reduce the risk of future disasters, but they also represent the end of the thread of issues that hints at potential architectural shortfalls. Following that thread can yield tremendous insight into the underlying pressures that generate these issues and how they might best be addressed.

Let’s begin with the contributing factor that most deviates from our conceptual model of how the system should function: the log spam about failure to push metrics to Redis. While the other failures were also significant, this was a case of diagnostics—normally used to keep the system healthy—instead breaking the system. The vast majority of logs were the same, though they had different time stamps:

connect() failed: Cannot assign requested address in ${APPLICATION_PATH}/${PROMETHEUS_LIB}/Redis.php

The fact that they killed the service is significant in a few ways. First, logs should report errors, not create them. Second, the file was full of easily compressible content. Third, the content was largely repeated and, in aggregate, useless. And fourth, the content didn’t represent an actual application error, but rather an error in the reporting metrics.

Each of these items suggests its own fix: logs to stream compression, log removal, and log sampling. However, by taking a step back, we can see another, broader solution: Logs don’t need to be written to disk in the first place. Rather, we can put a new architectural rule in place: Logs should be written to a stream (STDOUT or syslog).

The logic that addresses these problems has long since been solved in systemd-journaldsyslog, and other such stream-based tooling. But even if it hadn’t, it would be simpler to write our own stream handler for logs than to continue to maintain them on disk.

We can follow this thread even further. Normally, users are encouraged to log to STDERR/STDOUT within a container. In this case, the norm was violated to maintain a bespoke log-aggregation implementation. The postmortem proved that this effort didn’t yield a good return on investment. The thread ends with a final architectural principle: Do less.

Mitigating factors are likewise threads to be followed—it’s just as important to unpack successes as failures. In this case, a major mitigating factor was the wealth of publicly available information about Linux. From this simple fact we can draw up another architectural principle: Well-known systems behave in well-known ways.

Winding up the spindle of our postmortem, we’ve gone from a concrete set of outcomes to a general set of architectural guidelines to some fundamental principles of software design. These insights can be codified and reproduced for consumption through the organization, with the postmortem itself serving as a concrete example.

On . . . premortems?

Though postmortems can determine how a software system behaves in a production environment and how to modify architecture, they are usually only invoked as a result of some disaster: when the system behaves so spectacularly incorrectly that its shortfalls are evident and explanations are demanded. However, it’s possible to shift the parameters by which a postmortem is triggered. This involves identifying some aspects of the service that clearly indicate its utility to consumers. In the example above, it might be that the service can be measured in how many users were able to access the homepage of that service at any given time, compared with those who were presented an error message instead.

This service-level indicator (or SLI) becomes a proxy for the health of that service. With robust SLIs in place, set a service-level objective (or SLO) that describes how healthy the service should be. The error spamming the log file, for instance, meant that users weren’t able to access the homepage. If an SLO expected a service-request success rate of 99.9 percent, then the error in the example would have been caught and addressed immediately, just after users were presented with error messages instead of the homepage and long before the problem escalated into a full-scale outage.

As business needs evolve, so must software architecture. However, it’s difficult to anticipate where an organization will go, much less which architectural trade-offs will need to be made. The best way to revise software, then, is to continually evaluate whether it works under the new demands being placed on it. SLOs both determine and advertise how successful our services should be, and provide a clear signal when services are no longer meeting our expectations. And when software doesn’t work in the expected way, turn to a postmortem.

A postmortem, but pre-disaster? Yes, postmortems are laborious and time-consuming: reconstructing a time line, revisiting the assumptions under which the software was designed, potentially challenging organizational norms, and accepting that any given failure is a collective—rather than individual—responsibility. But postmortems are a powerful tool to ensure continued and deliberate revision of architecture in the service of the user, and can help us understand the interplay of software and business.

Key to this kind of postmortem—let’s call it a systems analysis—is measuring whether software is successful, rather than just if it’s broken. For example, my organization didn’t meet its primary objective and key result (OKR) in the second quarter of 2018: to successfully launch the progressive web app (PWA) version of the organization’s e-commerce service. We were supposed to launch the application in the four global regions in which the service operates, have 15,000 users employing the application as their primary shopping service, and have 3,000 users add the application to their home screens. But the PWA hadn’t even launched. We put together a time line for this systems analysis:

January 17: A senior organizational team member and a senior technical team member discusses the new PWA technology informally over coffee.

January 30: The senior technical team member creates an architecture for the PWA, estimating its cost and complexity.

April 1: The quarter begins. The senior organizational team member sponsors the implementation of the PWA as one OKR.

April 7: New API and AngularJS teams for the PWA project are formed. The project architect is no longer involved.

April 10: The API and AngularJS teams agree on rough specifications in the kickoff and split off to start work.

April 11: The API team decides to implement the REST API on top of the existing monolithic application.

April 21: The AngularJS team completes proof of concept.

May 7: The API team decides to implement a version of REST with only POST and GET verbs because the framework cannot tolerate other verbs.

June 7: The API team delivers the API to the AngularJS team.

June 8: The AngularJS team indicates that the API violates normal REST API design. The API is not well-documented and is unreliable.

June 12: The API team informs the project manager that the project is complete. The API team is reassigned to another project.

June 13: The AngularJS team is told to continue with the current API design, since revisiting it will be too expensive.

July 1: The AngularJS team indicates the website will not be completed in time due to difficulties implementing nonstandard API. They are advised by senior team members to deliver something quickly at the cost of quality.

July 12: The AngularJS team finishes the website, which is unreliable and slow. The project sponsor refuses to launch.

July 31: The quarter ends.

When first constructing a time line, it’s not usually evident what the actual problems were. Each team made a series of compromises (as is routine for all software engineering), but which turned out to have unanticipated effects? Only when we collectively process that time line can we establish the causality (i.e., the thread) which is often only apparent in retrospect.

Many interesting threads can be unraveled from this example. First, the product and API teams were given limited specs to implement. Second, the API team was implementing the project’s API on a framework that had accumulated years of shortcuts, which made it difficult to implement the full HTTP verb set. Third, the AngularJS team struggled to implement logic on top of an API that violated normal RESTful design patterns. Fourth, the API team was reassigned to another project before the product shipped. Fifth, the senior organizational team member—the person who had originally sponsored the project—received feedback late in the cycle.

It’s not incorrect to say that the API team, unaware of the downstream effects for the AngularJS team, compromised on the HTTP API. This compromise was not due to negligence, incompetence, or some other moral failure. Instead, they lacked visibility into other parts of the project and responded to the incentives they were given.

Because the API was nonstandard, existing tooling couldn’t be reused for it, and it wasn’t implementing responses correctly—returning weird payloads, nonstandard statuses. This meant that the AngularJS team needed to essentially reverse engineer the backend to get their part done, leaving little time to do the necessary performance analysis. Meanwhile, the company bounced teams between projects depending on what was in the pipeline and prioritized finishing projects early and cheap, since it ate the shortfall. These factors cascaded into this project’s failure to meet its deadline.

But what about the architecture? Well, there was no architect working on the project. Rather, there was a plan for the architecture that was handed from team to team with no feedback or check-in mechanism to make sure it was properly followed. While process and software problems also contributed to this failure, the lack of regular feedback verifying the software against user expectations was a central factor.

Software is a system, and postmortems are a mechanism to understand how a given system reached a negative outcome. Why place arbitrary limits on what those negative outcomes may be? Establishing more sensitive triggers than catastrophic failures for conducting a postmortem—or systems analysis—forces organizations to, at the very least, take pause and to reach for understanding. Some might also enforce stricter norms of correct behavior, such as how long a request should take, how long a job should remain in a queue, or how long a user should wait to receive a confirmation email after placing an order. This, in turn, may reduce customer impact when something goes wrong—because something will. The best-informed architects still design imperfect software.

A tool for feedback-driven architecture

As engineers, we’re tasked with representing the world in software. We do this to the best of our abilities, but we invariably get some of the details wrong. Systems analyses are a powerful tool to understand where our architectural model differs from reality, surfacing risks and shortfalls. And by approaching larger, slower-moving organizational outcomes with the same level of care, respect, and scrutiny as we do our systems failures, we can identify opportunities to revise the software to better match reality.

More often than not, a systems analysis doesn’t reveal that someone made an incorrect judgment given the information they had about a piece of software, but rather that someone had information that, at some point, no longer proved correct—or correct enough. Too often, architectures suffer from the myth of complete knowledge at the outset. Postmortems and systems analyses, instead, teach us that architecture must tolerate ambiguity and incomplete knowledge. Feedback-driven architecture—made possible by thoughtful review conducted in the wake of both systemic failures and missed deadlines—offers a different, more resilient model. When we accept the limitations of our knowledge, and put processes in place to deliberately identify and resolve problems when they arise, we can create architectures that evolve and endure with us.

About the author

Andrew Howden is an accidental software engineer turned accidental SRE. He aspires to simple, beautiful software development.

@andrewhowdencom

Artwork by

Giacomo Bagnara

giacomobagnara.com

Buy the print edition

Visit the Increment Store to purchase print issues.

Store

Continue Reading

Explore Topics

All Issues