Inside the complex world of life-saving software

Most of the programs you use every day, from word processors to smartphone apps, need oversight to ensure their usability and security, no matter how much planning and testing went into their production. For the narrow slice of systems and software that lives depend on, dubbed “safety-critical,” the requisite oversight comes alongside strict government, industry, and trade organization-mandated standards that prove the software is safe to use in high-stakes applications.

Today, software deemed safety-critical controls very complex hardware, from medical devices and cars to aircraft and nuclear reactors. Safety-critical software has unique requirements in each field, some of which are more regulated than others, but in general, regulatory agencies require extensive documentation to help ensure that the software is certifiably safe. This can equate to many times more documentation—including fastidious risk management documentation—than a comparably sized consumer software might need, given the extensive planning and testing that safety-critical software must undergo to meet certification standards. Through this documentation, teams must prove that their software is safe to use and has a very, very low chance of endangering human life.

The broad umbrella of safety-critical software development is notably conservative—about as far from Silicon Valley’s “move fast and break things” mantra as you can get. When lives are at stake, development is glacial, and the niche is generally slow to adopt the tech industry’s latest innovations. Connecting to the cloud, for instance, is off-limits for much safety-critical software, and restrictions can get extreme: Nuclear reactor software, for example, isn’t even allowed to connect to the internet.

But safety-critical is not uniform, and each field has its own rules that dictate how rigorous software makers have to be when proving safety. What connects them is a profoundly cautious and meticulous approach to software development.

Safety-critical: Programming to protect

One of the bellwether incidents that inspired greater rigor in the development of safety-critical software was a series of serious malfunctions by the Therac-25, a radiation therapy machine that killed or caused grave injury to six people in the mid-1980s. Extensive tests reproduced the lethal errors, which resulted from absent safeguards—safeguards that had been present in the device’s predecessors. Investigations revealed that the software for the Therac-25 was written by a lone, inexperienced programmer who wrote it based on the source code for the earlier Therac-20 and Therac-6. But the failure wasn’t just due to spotty software—the system simply wasn’t prepared for human use. The Therac-25 didn’t even undergo unit testing.

What all safety-critical products that can potentially cause great harm have in common today is an extremely low level of acceptable risk. Nuno Silva, technical safety manager at Critical Software, a consulting company that helps clients create and certify software products, often leads companies down the arduous path of making software for safety-critical fields and navigating those risks. While software doesn’t wear out like hardware does, it can and will fail; Silva noted that “the only system where you can have zero failures is the system with zero lines of code.” It’s essential for teams developing anything safety-critical to get the number of potential failures that could lead to harm as close to zero as possible.

With that in mind, not only do teams developing safety-critical software have to document how their software works and provide typical troubleshooting, setup guides, and how-tos, but they must also figure out what regulations apply to their product and follow a step-by-step process to make the software and prove that it complies with safety standards.

Let’s use medical software as an example. After identifying what regulations will apply to their software, which includes determining the software’s classification (which differs from region to region), the development team must effectively determine how much harm the software could cause if it fails. If applicable, the team must also implement a quality management system (QMS) to establish standard operating procedures for maintaining quality.

Then the team begins developing its risk management documentation, which will include a risk management report detailing all test cases and strategies. Safety-critical medical software teams must prove that they have a risk management process in place to address and mitigate risk, and they must create a report that presents all of this information in a tidy summary. The risk management plan must also describe how the team has analyzed and identified what would happen if the software fails. To develop that plan, the team must explore hypothetical situations with test cases that establish risk-mitigation strategies and methods. These test cases must be meticulously explained—it’s not just about how the risk mitigation plan works, but about how the team verified that it works.

When all of the trials and testing are finished, the team prepares a document (in the U.S., a premarket approval application) that collects everything into one detailed report. This ur-document, along with the software’s QMS, is audited by qualified regulatory agencies or assessors, such as the Food and Drug Administration (FDA) in the U.S. If everything is in order, the software or device receives certificates and/or clearance letters affirming that it is safe to use.

And that’s just a snapshot of the documentation process companies must follow if they wish to market medical software, or medical devices running on software, regional differences notwithstanding. The painstaking documentation of development and testing is a regulatory checklist for auditors to follow, which companies use to lay out the steps they’ve taken to comply with regulations. Other safety-critical fields have their own versions of this extensive testing phase, and their own requirements for the documentation that must be written to account for it.

So it goes that across safety-critical fields, much of the documentation teams create is for certification purposes, never to be seen by the end user. Often, customers just need the equivalent of the owner’s and maintenance manuals—standard user-facing documentation—plus a report saying that the product passed all of its tests.

The price of meeting regulations

Returning to our example, medical software has its own special set of challenges. Like any regulated industry, if it takes too much effort and expense to break into the market, companies will balk. It’s a tough balance for regulators, said Dr. Marion Lepmets, CEO and cofounder of SoftComply, which produces tools for managing software development through Jira and Confluence. Lepmets has seen companies struggle to afford the costs involved in complying with safety-critical regulations and become tempted to shirk the auditing process.

“[Regulators] have to make sure you are not putting anyone at risk. But on the other hand, it’s so expensive for the companies and manufacturers of the devices to go through these audits that [they’re] actually increasing the incentive for [the companies] to try to sneak away from the audits,” Lepmets said. “It’s sort of a delicate balance of compliance and quality that regulators are now looking for.”

Companies are required to abide by regulations according to the particulars of how a device is classified. For example, the current general international standard for medical software is IEC 62304, which has three distinct classifications: Class A means no injury or damage to health is possible; Class B means non-serious injury is possible; and Class C applies to all programs and systems that are potentially life-threatening. (In the U.S., the FDA equivalent designation is Class III.) Companies may also be required to adopt a quality management system (as dictated by ISO 13485, or, in the U.S., FDA CFR Part 820), and submit their product for regular audits.

Compared to typical consumer software, Class C software often needs significantly more documentation due to the requisite details in planning, reporting, and conducting clinical trials. The latter in particular accounts for a lot of time and expense. As a result, companies might try to stay in a sort of regulatory gray zone—say, by claiming their products aren’t medical devices when they clearly are just to sidestep regulations.

The future of safety-critical

The forward march of technology requires safety-critical fields to anticipate new failure cases and explain how (and why) their software won’t fail. This has been complicated by new concerns around cybersecurity as internet connectivity and malware introduce new vectors of cyberattack. In response, the safety-critical community has proposed new methods to protect products and software, but digital security approaches have not been broadly codified into standards yet; once they are, there will likely be new cybersecurity-focused testing and documentation requirements to comply with.

But other documentation practices will change as a result of industry shifts that have nothing to do with progressing technology. The medical software field, for instance, is eyeing a shift from compliance with established standards to quality assurance—what the FDA calls the Case for Quality. Under this program, an auditor would look much more closely at the software development process and the code itself, said Lepmets. Although it’s not a part of current regulations, this approach would bring assessors into the development process sooner, where they might identify opportunities to improve the software rather than simply giving it a pass or fail during an audit. The information that development teams provide and document might shift accordingly.

While safety-critical standards dictate the amount of documentation that teams must create, the standards themselves need to be updated at certain intervals. One in particular, IEC 61508, is a broad, generic standard for safety-critical software that applies to many fields. It’s slated to be updated in the next decade, and business developer Thor Myklebust of SINTEF (the Foundation for Scientific and Industrial Research at the Norwegian Institute of Technology) hopes to add something to the standard’s next version: codified information about how to comply with IEC 61508 if teams use agile software development. SINTEF has been using agile methodologies for its own safety-critical software projects since 2011, Myklebust said, but new ideas can face a tepid reception.

Safety-critical regulators have been slow to welcome emerging technologies and methodologies that are prevalent in the wider tech industry, like agile software development. Myklebust and his colleague Tor Stålhane wrote a book, The Agile Safety Case, that lays out how teams using agile development can meet compliance requirements and satisfy safety-critical standards assessors. Their plan limits the number of documents that need to be revised when updating software, which would speed up the process considerably. In the railway industry, where Myklebust primarily operates, it might take six months from writing the last line of code on a project to when it’s implemented on a track or signaling system; he reckons it should only take a week or so. A shorter development process could make safety-critical software teams more flexible and quicker to respond to new challenges. This, in turn, could make safety-critical software fields more open to embracing newer technologies—once they’re deemed safe, that is.

A newer tech industry standard that has challenged safety-critical regulators is integration with cloud-based tools. Some safety-critical domains, like nuclear power plants, aren’t even connected to the internet, in order to ensure their safety; each new version of their software must be installed in person. Cars, on the other hand, are increasingly connected, and automakers like Tesla can update car software over the air. But auditors have expressed security and safety concerns around cloud-based tools that store data.

“As a safety-critical software developer, you have to be in charge of all of the changes made in the software tools you use,” said Lepmets. “Even with precautions like backups, software validation, and regular health checks of the cloud-based tools, regulators are struggling to ensure that the developers are in fact in complete control of all of the changes in their software tools.”

The safety-critical world will always need rigorous documentation and testing to ensure that the industry’s software can be entrusted with people’s lives. While this means that the field is slow to onboard new software tools and methods that most developers take for granted, it makes the larger software landscape something of a test bed from which safety-critical can cherry-pick the best advances to adopt and refine. Better, perhaps, for software to take a little while longer to develop than to fail with so much at stake.

Safety-critical: Programming to protect

The price of meeting regulations

The future of safety-critical

About the author

Artwork by

Topics

Buy the print edition

Continue Reading

Cloud

David J. Lumb

The U.S. Government’s long road to adopting the cloud

Security

David J. Lumb

The story of Signal

Energy & Environment

David J. Lumb

Tomorrow’s power grid

Programming Languages

David J. Lumb

The ABCs of language migration

Teams

David J. Lumb

Decoding job titles

Development

Suz Hinton

A guide to coding accessible developer tools

Documentation

Glenn Fleishman

“How-to” build a civilization

Security

Shraya Ramani and Logan McDonald

The process: Open sourcing BuzzFeed’s single sign-on experience

Security

Chris Stokel-Walker

The mystery of steganography

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call