On Safety-Critical Software Development

Software—for all its boundless possibilities—can kill.

From 1985 to 1987, the cancer treatment machine Therac-25 produced lethal doses of radiation, unbeknownst to those operating it. The difference from previous models? Its safety mechanisms were, for the first time, fully controlled by software.

In 1991, a floating-point error drifted the internal clock of the U.S. Army’s Patriot missile defense system by one third of a second, causing it to fail to intercept an incoming Scud missile in Saudi Arabia. Twenty-eight U.S. soldiers were killed and roughly 100 others were injured.

In 2018 and 2019, two Boeing 737 MAX planes crashed due to a combination of sensor, software, and design problems that caused the plane to repeatedly push its nose downward in response to manual input from the pilot. More than 300 people died.

We’ve all heard about such safety-critical software failures. But consumer technology can have deleterious consequences, too. Take, for example, a 2015 incident in which a drone propellor sliced a toddler’s eye after the person piloting it lost control. Such incidents can also occur without human manipulation: In 2019, an autonomous delivery drone crashed just 150 feet from a group of kindergartners in Switzerland. Thankfully, no one was harmed, but such accidents risk becoming commonplace—and more pernicious—without intervention.

Developers are generally quite good at what they do—indeed, most software faults of this type are the result of erroneous or incomplete requirements, not erroneous or incomplete implementations. Failing to consider resiliency and reliability methods in software development, however, can result in consequences that range from minor financial hits to true tragedy. Many of these methods are practiced by developers of safety-critical software, but they’re used less frequently in consumer products.

We should be developing prophylactically in consumer software too, particularly for products that are connected to the internet and operate within the physical environment. From robotic lawn mowers to consumer drones to baby monitors, thinking in terms of reliability and resiliency matters. Formal methods like software specification, which tests assumptions about requirements, can and should become common practice among developers. In addition to writing code, we can take part in the requirements engineering process as a whole, using mathematical methods and tools to find, discuss, and modify potentially hazardous or faulty requirements early on.

The case for formal methods

Software specification doesn’t need to encompass the entire codebase or its implementation. Rather, it should guide the design of the code before it’s written and throughout its life cycle—from design to deployment to maintenance—to ensure that additional features or significant architectural changes don’t introduce hazardous behavior.

As software consultant Hillel Wayne put it in his 2020 blog post “The Business Case for Formal Methods,” “You write a specification of your system and properties you want it to have. Then you can directly test the design without having to write any code and see if it has problems. If it has a problem, great, you can fix it without having spent weeks building the wrong system. If it doesn’t have a problem, you can start implementing with confidence you’re building the thing right.”

Wayne noted that modeling can flag issues that tests might miss, or even catch bugs before they’re implemented, saving developers time and energy rewriting code. Amazon, Elastic (the creators of Elasticsearch), and Cockroach Labs have all used formal methods to catch bugs in the design phase or uncover complex bugs that slipped past tests, QA, and code review.

Adding to Wayne’s business case for formal methods, the safety failures noted earlier illustrate the need to teach and incorporate thinking in specifications into consumer software development. As we increasingly trust and bring these devices into our homes, we ought to have some guarantee that the software controlling them won’t cause harm.

Modeling for reliability

When software is built with incomplete or erroneous requirements, it’s a sign that a mental model has failed in architecting the software system. This is where reliability analyses and other dependability and systems engineering methods come in. Thinking about the specification of the system and constructing a model of possible erroneous or hazardous states is a stepping stone to implementing requirements that will prevent unsafe software from being deployed in the wild.

To be sure, this can be challenging: Real-world systems, with all their complexity, might require modeling a great many potential states. But by testing our assumptions with these practical and lightweight formal methods, developers are able to write code that’s safer and more reliable from the outset.

Modeling techniques for assessing the reliability of a system specification are relatively straightforward, even if the implementation itself is complex. The system is modeled as a set of states, with probabilities assigned to the arrows between states. These arrows can assign a probability of failure or a probability of recovery.

In this state-space model, the states don’t model the architecture of the actual system but rather whether the system is in an operational, fail-operational, fail-safe, or fail-unsafe mode. “Operational” is how the system is expected to behave under normal conditions. “Fail-operational” defines the possible states in which the system has failed but still operates as expected. “Fail-safe” states are those in which the system has failed and is transitioned to a state that won’t cause a hazard, even if it’s operating in a degraded manner. “Fail-unsafe” is when the system has failed and, because there’s no built-in mitigation for that particular scenario, it becomes a hazard that can cause an accident. The complete degradation of the system—an unsafe system state—is known as the death state.

The overarching concept is this: A system can operate under failure. Modeling for reliability, therefore, considers how to design a system that doesn’t cause catastrophic consequences, even if it’s operating in a degraded manner.

By making these formalisms part of software specification, along with system assurance, and by keeping these well-studied and simple reliability notions top of mind, we’ll be better able to design for states of failure and recovery before failures occur. In addition to what the system ought to do, we’ll develop for what the system ought not to do.

This is just the first step. One of the lessons a systems perspective can teach us is that ensuring the reliability of components is not enough to ensure the reliability of the whole system. Software engineers should be familiar and prepared to deal with other coupled properties, such as safety, security, and dependability—and, potentially, be prepared to explain how their code adheres to these standards during code review.

Designing for resilience

There’s a wide spectrum of implications and scenarios that are difficult to account for in logical or quantitative models of software behavior. While reliability is a fundamental and useful way to think about losses through notions of failure, resilience is an actual solution. One of the definitions of resilience is a system’s ability to reestablish its operation in the face of erroneous or malicious inputs, often within a quick time frame.

Resilience can add a useful set of design patterns to a software engineer’s toolbox, particularly when designing software for embedded systems. In the presence of a loss scenario, a developer should take resilience patterns such as diversification and redundancy into account when architecting software for machines that operate in the physical world.

Diversification and redundancy are often implemented hand in hand. Multiples of a sensor or processor (often from different manufacturers, hence “diversification”), for example, are included in the system so the failure of one component doesn’t mean the failure of the entire system.

Although diversification and redundancy have been applied to software through methods like n-version programming (a process in which developers independently produce multiple functionally equivalent versions of a program from the same set of specifications), these methods haven’t been proven to work at the implementation level. They are, however, extremely useful for software engineers who have to think about systems as a whole: hardware and software combined. In this context, diversification can mean writing software that supports multiple versions of firmware or hardware, so if one fails because of an intrinsic fault, the system’s operation remains unaffected.

These design patterns teach us that two inertial measurement units (redundancy) from different manufacturers with different firmware (diversification) are better than one; three, or any odd number, is even better. (With two sensors, it may be hard to tell which one is failing; with three, you can trust the output of the majority.) Among other safety improvements, odd-number sensor diversification has helped make space and commercial flight much safer over the decades.

Reliability and resiliency tactics must permeate all of system development and shouldn’t fall only to the safety or systems engineer to recommend. In a parallel universe, an odd number of diverse redundant sensors would have been included in the Boeing 737 MAX’s software, and the crashes might not have occurred.

Development for the real world

It’s very unlikely we’ll figure out how to make systems 100 percent safe 100 percent of the time, but reliability and resiliency practices are among the most effective tools we have, and they can be used today. They’re also relatively simple and straightforward: Open-source tools like NASA’s WinSURE reliability analysis program make using reliability models throughout development a practical reality.

As more human-computer interactions take place in the physical world, it’s increasingly important for software engineers to take responsibility for the safe operation of the systems they develop and learn from past mistakes. By taking a more proactive stance, we can unlock software’s vast possibilities with greater confidence that our systems will not cause harm.

The case for formal methods

From issue 11

The epistemology of software quality

Modeling for reliability

Designing for resilience

Development for the real world

About the author

Artwork by

Topics

Buy the print edition

Continue Reading

Reliability

Heidi Waterhouse

Everything is broken, and it’s okay

Reliability

Ryn Daniels

How to build organizational resilience

Reliability

Tanya Reilly

Embrace your inner incident commander

Reliability

Tess Donnelly and Tiarnán de Burca

Trust is an enabling technology

Planning

Leemay Nassery

Reframing tech debt

Planning

James Stanier

Planning for momentum

Planning

Melissa Huang

The great tightrope act

Planning

Pete Hodgson

Road to somewhere

Planning

Mikio Braun

Tools for people

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call