In space, no one can hear you kernel panic – Increment: Software Architecture

When you’re millions of miles from home, it’s hard to install an operating system update—but not impossible. From the dawn of the Space Age through the present, NASA has relied on resilient software running on redundant hardware to make up for physical defects, wear and tear, sudden failures, or even the effects of cosmic rays on equipment.

The software architecture of space missions must be robust without being rigid to “deal with the kinds of uncertainties that arise in the context of space,” says David Garlan, a founder of the field of software architecture and a distinguished visiting scientist at the NASA Jet Propulsion Lab. Garlan, who is also an associate dean and professor at Carnegie Mellon’s School of Computer Science, says spacecraft systems, in particular, need a fault-protection layer that allows them to switch to emergency protocols without immediate intervention from Earth. But he also believes spacecraft should be designed with more autonomy in normal operations so they can achieve broader scientific goals without a constant guiding hand on the controls by scientists.

This makes for an architecture that might dismay terrestrial developers, since the software must avoid becoming bogged down in tasks. A data center server’s slow performance can be fixed by throwing more servers at it—virtual or real. The computational power on a spacecraft remains static for the mission’s duration, and systems must be designed to dump any given task without warning. A database server won’t melt down and bring adjoining racks with it if it can’t insert a row in real time, but a craft hurtling toward Mars might miss the planet completely if its cycles aren’t perfectly managed.

A craft’s software is also made more resilient by doubling—or quadrupling—down on replication and physical backups. For NASA’s Space Shuttle Program, which ran from 1972 to 2011, three or four computers weren’t enough: Shuttles had five flight computers, and planners considered a sixth. “Once you get humans on board, you’re in a whole different game,” Garlan explains: The tolerance for risk is minuscule.

Though running identical software on multiple computer systems is the name of the software-architecture game across crewed and uncrewed missions, satellites, probes, landers, and rovers, they take different approaches to dealing with errors, updates, and detection. NASA’s obsessive focus on software testing to find and remove bugs, plus a strategy to allow software to recover in the worst of circumstances, is one approach that has paid off repeatedly. First implemented midway through the Apollo program, which ran from 1961 to 1972, the strategy was designed explicitly for when things go wrong. Without it, mission after mission would have had to be abandoned or would have only achieved a fraction of its goals.

During its 1977 launch, for example, the NASA space probe Voyager 2 couldn’t interpret all the shaking it recorded during launch; mission scientists hadn’t anticipated how its sensors would read that activity. But the probe correctly knocked itself into a recovery mode and restored itself. This provided insight that allowed scientists to update the code for its sibling, Voyager 1, in time for its reverse-order launch 16 days later. Apollo 11 would have had to abort the first moon landing had its software not been designed for instant, continuous recovery. The Mars rover Opportunity would have seen a premature end to its much-extended life if a cable short hadn’t been bypassed by rewriting which measurements were gathered for movement. “(Launched in 2003, Opportunity was initially supposed to stay active for 90 Martian days; it remained operational until mid-2018, for a total of 5,111 days.) The Mars rover Curiosity might have failed during its first and fifth years on the planet after experiencing glitches in its two main computers. (Launched in 2011, Curiosity was initially supposed to stay active for 687 Martian days; it remains active today at over 2,700 days and counting as of this writing.)

Sometimes redundancies also add opportunities. We would have far fewer pictures from Voyager 2’s pass by Uranus, in 1986, and Neptune, in 1989, without a duplicate set of computers standing by, coupled with mission control’s ability to upload new software to take advantage of them. The inertial measurement units of the Mars Odyssey, launched in 2001, and the Mars Reconnaissance Orbiter, launched in 2005—which allow the two satellites to determine their absolute position in the universe while also noting changes in rotation, orbital speed, and other parameters—are nearing the end of their utility in the coming months (or years). Currently, they help the orbiters remain in the right spot and elevation above Mars, and keep their antennae pointed back at Earth. When these sensors fail, it would mean an end to the orbiters’ ability to collect and transmit data back to Earth—were it not for flexibility built into the software. Scientists are testing new software that will eventually allow both satellites to analyze their position from a star-tracking camera on board each craft, affording them additional years of usable life.

Expect the unexpected

Anyone who has managed physical or virtual servers under load understands the trade-offs between keeping a handful of critical machines running with a few minutes of downtime a year and having failover solutions that allow any link in a chain or any parallel task to have its slack picked up by another system. It usually boils down to cost and criticality: Can you afford to have that extra capacity? What’s the worst thing that happens if it fails briefly? A company’s website typically doesn’t freeze forever or crash into a planet if it’s down for a few minutes.

Still, earthbound computing hardware has evolved from monolithic business mainframes to redundant arrays of powerful servers that allow for the failure of one or more of them without breaking the business. NASA presaged this move decades before the rest of humanity out of necessity: In space, having more computers running duplicate functions, capable of weathering catastrophe, worked far better than having a monolithic system for which failure wasn’t a possibility. This has also proved, over time, to be the right course. Perfect software, perfect hardware will still break in space—a backup (or two or three or five) defeats Murphy’s Law and cosmic rays.

But it wasn’t always that way. In the 1988 book Computers in Spaceflight, commissioned by NASA, author James E. Tomayko notes that during the Apollo program, the agency focused on ensuring every component and system was tested until it was determined to be perfect. This, however, resulted in a process that was both expensive and brittle.

Later missions introduced a variety of architectures, which were still subjected to relentless testing to eliminate bugs but which allowed for a recovery or standby mode when failures occurred for any number of reasons, such as a hardware module breaking in flight or radiation damage.

A pair of famous examples illustrates the problem with relying on perfection. Software engineer Margaret Hamilton, the director of the MIT group that developed Apollo’s software in the late 1960s, had a hand in both. Hamilton frequently brought her daughter Lauren, then a toddler, to the office on late nights and weekends. Before the Apollo 8 mission in late 1968, which would mark the first time astronauts circled the moon, Lauren was playing with the command module simulator via a DSKY, a keyboard and display combination. She managed to crash a flight simulation by unexpectedly triggering a prelaunch sequence.

Hamilton tried to get NASA to let her introduce error checking to prevent an astronaut from making the same mistake during the mission, however unlikely. NASA overruled her, insisting astronauts would perform the task perfectly. Hamilton was reduced to putting a note in the manual about the possibility of this problem.

Then astronaut Jim Lovell selected the same sequence during Apollo 8’s flight, purging from the craft’s memory the navigation data that was required to return to Earth. After a scramble that was less telegenic than the Apollo 13 “Houston, we have a problem” scenario—more like finding a backup tape than hopping aboard a makeshift lifeboat—Hamilton and her team were able to transmit the navigation data from Earth, as the system was flexible enough to accept those inputs in transit.

This may have affected the design that led to—and recovered from—a failure during the first moon landing, on June 20, 1969. A documentation error caused astronauts to leave a radar switch on, which generated a stream of data that the lander’s computer had to process while it was engaged in the complicated task of setting down. The system produced an error that the astronauts relayed to Earth: “1202.” (Yes, obscure error codes date way back, too.)

Hamilton had designed the system to be resilient, recovering without interruption if conditions arose that overwhelmed it, and allowing it to report errors with sufficient information to make judgment calls. In this case, the computer’s load-management software focused on higher-priority tasks, including the radar input, and performed just as expected. After a tense and rapid set of consultations at mission control, the lunar lander astronauts were given the go-ahead just seconds before fuel ran out.

“Our software saved the mission because it was asynchronous—it bumped low-priority tasks,” Hamilton told Air & Space magazine in 1994. “Without it, the mission would have aborted or crashed on the moon.”

Even though the software design saved the day, it required human intervention. That intervention is rarely feasible for split-second decisions that could mean the difference between aborting an action and ending a mission, including the potential for loss of life.

NASA conceived the Space Shuttle Program before the Apollo missions wound down as a way of providing reusable launch vehicles that could accommodate periods of time in space, carry substantial cargo, accomplish scientific research, and ferry building materials, supplies, and passengers to a permanent orbital station. (The shuttle aided the International Space Station’s construction.)

The reusable part was key, and it meant that the shuttle had to be designed to cope with both fast turnarounds and more potential for failure over time as it aged, no matter how well the maintenance was performed. As Tomayko wrote in Computers in Spaceflight, the shuttle was designed “to sustain mission success through several levels of component failure.”

Computers had become more powerful, and software engineering—a term Hamilton coined—for spaceflight had matured over several years. As a result, instead of thinking about a main system and a backup, or even a couple backups, the shuttle’s design relied on four independent computers running identical navigation and guidance software and receiving the same inputs.

The four computers functioned as a democracy in miniature. Three of them had to agree on what they measured in order for an action to proceed. If three agreed and the fourth balked, the astronauts would turn it off or restart it. This allowed the quick decision-making required to avoid catastrophe or an expensive halt.

If multiple computers failed or couldn’t agree, an extra computer that also had access to the shuttle’s controls could take over. It would allow for a preprogrammed rough (but safe) ascent, abort, and reentry.

The first time this system triggered was during an early test, when the shuttle Enterprise was being carried up on a Boeing 747 for a landing test. One computer failed just after the shuttle was released from the plane; the voting system worked and the landing succeeded. “That incident did a lot to convince the astronaut pilots of the viability of the concept,” wrote Tomayko.

Garlan advocates for more autonomy for ever more complicated spacecraft systems. Among systems engineers and architects who, like himself, are developing software for future missions, there are “a lot of discussions about what sorts of levels of onboard autonomy are needed. You can’t afford to shut down and go home.”

A redundancy of redundancies

While the space shuttle systems were being designed in the 1970s, a more immediate mission was already underway, and it needed a different kind of resiliency: the Voyager missions to the outer planets. In 1964, a Jet Propulsion Lab (JPL) scientist, Gary Flandro, calculated that by the late 1970s, Jupiter, Saturn, Uranus, and Neptune would be in the right alignment to allow a probe to visit each planet by using gravity assists, swinging around a planet for acceleration. Such an alignment happens once every 175 years.

Many aspects of the Voyager mission were redundant, starting with the fact that two probes were sent out and launched at different points. (The original plan had been to launch more probes; the Nixon administration slashed the budget in light of economic woes.) The probes received a last-minute shielding upgrade for their electronics after project scientists analyzed results from the Pioneer 10 and Pioneer 11 missions past Jupiter.

Voyagers 1 and 2 were each equipped with two parallel computer systems, as were later missions. Were a failure or crash to occur in one of the three “A” computers—one each for command, data management, and attitude control—the probes could switch over automatically or by command to a “B” system. The command computers were even designed so that if portions of A or B failed, mission managers could swap over just that component.

The Voyager probes were also the first NASA spacecraft to use software to detect and automatically recover from faults. They could interpret a variety of events and react accordingly without instructions—although that independence led to a panic during the launch of Voyager 2. As John Casini, the mission’s prelaunch project manager, noted in an interview in the book Voyager Tales: Personal Views of the Grand Tour, “the launch vehicle was turning at a much faster rate than the spacecraft would ever turn in space.” However, Voyager 2 correctly recognized that it didn’t know what was going on and put itself into a safe mode.

“The spacecraft recovered itself after it was separated from the launch vehicle,” even before the ground crew figured out what had happened, Casini said. The project staff subsequently updated Voyager 1’s software before launch, and the probe managed to process the spinning and juddering properly.

While the computer systems were designed for redundancy, running the same software in parallel to allow a rapid switchover, the command systems could also be set during less important phases to run different programs or run a single task together. During critical moments, like the flybys of Jupiter and Saturn, the systems were running in tandem. NASA swapped out the control and flight software around 18 times during the Jupiter flyby—a remarkable (and planned) long-distance system-management feat, and one that absolutely required a backup, even though everything executed remarkably close to plan.

The primary mission for both Voyager probes was to study Jupiter and Saturn, but the extra capacity let Voyager 2 proceed to Uranus and Neptune. But there was a problem: At Uranus’s distance, the already low data-transmission rate at Saturn would slow to a tiny trickle, and by Neptune it would be even worse. Some improvements could be made on Earth by increasing antenna size and grouping multiple antennae together. NASA was also able to update the onboard software to shift data transmission to a more efficient but experimental error-encoding hardware device. (The encoder was so advanced, there was no decoder yet built on Earth when the Voyagers were launched with one encoder each.)

These improvements helped, but the backup command computer kicked in as well by doing double duty. As Voyager 2 snapped photos, one of the command machines ran the mission while the other squeezed the pictures down using an early image-compression algorithm. These rare outer-planet flybys produced much more information as a result.

The Voyager probes have left the confines of the sun’s magnetic bubble—not the solar system, just its so-called heliosphere—and now swim in the interstellar medium. But even at that distance, they’re sending back data and can still surprise the small team that continues to manage them. In 2010, Voyager 2 began sending back gibberish instead of scientific data. Scientists shifted the probe into a standby mode, the code for which was shaped by decades of refinement, while they figured out what had gone wrong. This included having the current program dumped back home at the 160 bits-per-second data rate the Voyagers now use.

It turned out that a single bit of memory in the decades-old flight data computer had switched from 0 to 1 unexpectedly, likely due to stray radiation. After confirming the error, JPL was able to switch Voyager 2 back into “science data mode.”

The future of redundancy

While this kind of redundancy seems like the right choice for missions where hardware upgrades are impossible—especially when paired with flexible systems that allow software updates and enhancements—it can lead to related woes. In a paper titled “The Role of Software in Spacecraft Accidents,” Professor Nancy G. Leveson of MIT’s Aeronautics and Astronautics Department wrote:

A NASA study of an experimental aircraft with two versions of the control system found that all of the software problems occurring during flight testing resulted from errors in the redundancy management system and not in the control software itself, which worked perfectly. We need to devise protection features for software that reflect the “failure” modes of software and not those of hardware.

It’s ironic that redundancy would be the cause of an error instead of the solution.

A separate but significant problem is that modern spacecraft systems are vastly more complicated than their predecessors. With more computational power available and far more tasks that can be handled during missions, the code that drives the vessels of today is incredibly complex: It’s inevitable that there will be errors too complicated for duplicate (or quadruplicate) computers to solve. The same error can occur simultaneously on all systems with the same input, making a failure across multiple computers all at once a real possibility.

This can be exacerbated by the lack of a standard software architecture, something Garlan has tried to advocate for over more than a decade of teaching an annual course on the subject at JPL. He says NASA systems often rely on a lot of rules instead of on global visibility across the system.

“They would prefer to take the thing before [and] tinker with it so it works for the new thing,” Garlan says—a combination of an appropriate aversion to risk and a problem with moving forward on well-understood, modern principles that reduce complexity.

That means that even the most advanced spacecraft being planned today relies in part on older notions of software architecture, albeit with a twist. The Orion spacecraft, designed to carry astronauts to the moon in a future crewed mission, will carry four computers, each with two processors working in parallel, the results of which have to agree. Each computer’s software will behave as if it’s independently flying the vehicle.

Instead of a democracy, these computers are solipsists, each believing they are the sole inhabitant of their universe. If one computer fails to provide the right instructions at the right time, the systems are designed to accept commands from the next while the failing system is rebooted. Orion will also have a backup flight computer, as with the space shuttles.

It’s been a long time since humanity went back to the moon, a celestial body of which Earth has only one. A safe return requires a fivefold effort.

Expect the unexpected

From issue 6

Inside the complex world of life-saving software

A redundancy of redundancies

The future of redundancy

About the author

Artwork by

Topics

Buy the print edition

Continue Reading

Programming Languages

Glenn Fleishman

It’s COBOL all the way down

Documentation

Glenn Fleishman

“How-to” build a civilization

Security

Glenn Fleishman

Free certificate authorities and the rise of the encrypted web

Internationalization

Glenn Fleishman

For want of a typeface

Open Source

Glenn Fleishman

A license to share

Testing

Glenn Fleishman

Interview: Dorothy Graham

Containers

Glenn Fleishman

Interview: Joe Beda

Development

Glenn Fleishman

Interview with Isaac Z. Schlueter, CEO of npm

Energy & Environment

Glenn Fleishman

Interview with Ramez Naam, futurist, author, and energy tech investor

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call