A bad plan burns out at the first sign of trouble. A good plan keeps burning.
Five years ago, my employer, an ed-tech provider for primary schools, was struggling under the weight of its own success. One of our core maintenance tasks was a semi-regular, totally manual configuration management for each customer. As our user base grew, our support team needed to spend more and more time on this flow. Eventually, they realized they’d need to hire someone to manage the process full-time.
Why not automate the dang thing? This was a question we’d asked ourselves many times, but company leadership always decided it was too dangerous. The flow was an asynchronous mess of business rules that involved coordinating several third-party systems and directly modified our customers’ IT networks. If we messed up, clients would cancel contracts. As a result, the company considered the manual process safer.
But the workload quickly became unsustainable. We soon learned that “we’ll need a new full-time hire” actually meant “we’ll need a new full-time hire every year.” And if our expansion plans took off, we’d need two hires per year. It was time to automate.
The project manager identified three properties the new system would have to maintain for it to safely replace our manual process. With those requirements in hand, I started planning out the system—in a rather unconventional way.
Earlier that year, as part of a batch uploader project, I’d tried out a planning technique called formal specification that allowed our team to identify critical bugs in the architecture and shaved a week off of delivery. (I wrote about it in depth in a 2017 Medium post.) I decided to apply the same techniques to this project. Complicated systems often have subtle, byzantine bugs, and leveraging formal specs would offer a way to identify them ahead of implementation. It was time to set some plans alight.
Formal specifications as flares
Much of software engineering entails grappling with unknowns and finding your way in the dark. Imagine you’re walking down a hill at night: You know there are branches, brambles, and boulders strewn along the path, and because your flashlight can only illuminate what’s directly in front of you, it doesn’t make sense to plan your entire route. Instead, you’ll have to take a few shuffling steps, reorient, and try not to trip.
Alternatively, you could light a flare and roll it down the hill. As it ignites, the flare illuminates hidden obstacles that might have tripped you up. The hill becomes a little clearer, the route less obscure. You’ll still have to apply some guesswork as you adjust and improvise your path, but you’ll have a better sense of what to watch out for.
Good plans require that flare. They not only light the path forward, but also show you where not to tread, making the consequences of wrong paths clearer. While they won’t reveal every potential pitfall, they can provide that extra bit of brilliance that keeps you from tripping.
In software engineering, these flares take the form of formal specifications, or blueprints that describe, analyze, and verify a system design. Often, we spend the bulk of our coding time tinkering with plumbing and fiddly bits; the essential behavior we’re trying to program is just a small part of the equation. Making sure the essential behavior is correct before you start a project can save a lot of time and energy down the road. To do this, we can write a simulation of what the completed system will look like—a specification—and run tests on it. It won’t be a flawless representation of the code, but the problems you observe in the spec will most likely be problems in the final system. Better to illuminate them early.
Specification in action
Typically, specifications are written in formal specification languages such as TLA+ or Alloy. While programming languages are designed to be run, specification languages are designed to be simulated, and therefore include features like nondeterminism (this call can succeed or fail), atomicity (the server can crash between these two calls), and scalability (10 servers will be trying to make this call). Specification languages can also delineate the parts of the spec you’ve carefully thought through, as well as the parts you’ve shunted to the side for the time being.
Here’s a simplified snippet of TLA+ representing two workers concurrently trying to reserve a resource:
# definitions and variables
process workers, count: 2 {
Atomic_Action_1: resource := next_unused_resource();
Atomic_Action_2: reserve(resource, self);
}
You might see the problem here: In the time between when the first worker retrieves the next unused resource and when it reserves it, the second worker could retrieve the same resource—a race condition. The issue is easy to identify in two lines of spec; it’s a teensy bit harder when those two actions are separated by hundreds of lines of logging, API shims, abstraction, and cleanup. We can spot trouble more easily in the plan than in the implementation.
If you didn’t catch that bug, don’t fret—you’re not a computer that can simulate every possible concurrent execution. Fortunately, you don’t have to be. Since the spec is simpler and higher-level than the code would be, we can use a model checker—a program that takes a spec and exhaustively simulates every possible scenario that could occur—to more thoroughly test it. If it finds an issue, the model checker provides an error trace that shows exactly how to reproduce it.
Sometimes the problem is a simple mistake, like putting a lock in the wrong place or forgetting to handle an empty input. But other times, the error trace reveals an unknown: maybe two operations you assumed were unrelated need to be ordered, or a third-party service is unreliable, or there’s an extra requirement to be legacy compatible. The spec—your flare—allows you to spot these snags and start over, coming up with a new plan to feed the model checker. Sure, it might be annoying to throw out a mostly working plan, but it’s far worse to throw out a mostly working codebase.
Eventually, through this process, you’ll converge on something that works—the smoothest path down the hill. You’ll still need to make adjustments, deviating when you encounter an unexpected crag or bluff. But you’ll be able to move much faster than if you didn’t have a plan at all.
The right course
Sometimes, however, you won’t converge on something that works. This brings me back to the configuration management project I mentioned earlier. After two days of modeling, I had a lot of designs that wouldn’t work, but nothing that would. Some would break immediately; others had subtle design flaws that could go unnoticed for months in production before causing the system to collapse. Many could guarantee two of the requirements, but none could guarantee all three.
I’d worked on projects like this before—ones where we couldn’t satisfy all of the requirements and found ourselves tumbling down that hill. We’d hit one edge case in production and suddenly our putatively stable system would come crashing down. We’d deploy frantic fixes and cross our fingers, only to discover the fix had poked a hole in another requirement. But this time, with our spec in hand, I could survey the terrain and say, “Something’s wrong.”
There was nothing to do but drag the project manager over, show them how every obvious and less-obvious design would fail in a different and spectacular way, and ask which of the three properties we could jettison to make the system work. Soon after, our in-house design experts determined that one requirement didn’t need to always hold, as long as it would usually hold. Knowing this, we were able to narrow the project down to just a few designs—all of which, the model checker confirmed, would work.
The formal specifications had done their job, illuminating several possible paths down the hill. Now it was time to apply a little good old-fashioned reasoning. After some discussion with the project manager about our timeline and scope, we settled on a single design and got to work.
Steady as we go
From that point on, the project went smoothly. The specification process had pointed out numerous obstacles in our way, which kept us from tripping. Even when we encountered unknown unknowns during implementation, the specs helped us react and reorient. Often, for example, the seemingly obvious solution would have turned the design into one we already knew wouldn’t work. (No sense in seeing a hole and jumping over it, only to land in a different one!) The automated configuration management service ran efficiently for years afterward, well beyond the tenures of the engineers involved.
I came away from the experience wondering why formal specification wasn’t more widespread. It had proved valuable at our startup, which had just 10 engineers. How much might it help development teams building software at scale? Through research, I discovered UX issues are the primary barrier: The tools are hard to learn, hard to use, and their benefits might not be immediately obvious to users. These are genuine problems, but addressable ones. UI and documentation can bridge the gap between potential and practice—I’ve since dedicated my career to helping developers learn formal methods, writing books on the subject, and spearheading documentation projects, including for the Alloy language.
These days, I’m a full-time consultant in formal specification and an advocate for the process’s ability to improve our work as software engineers. When we can see which designs will function—and, more importantly, which ones won’t—before we start developing, we can build systems more cleanly and quickly. Formal specifications allow us to better appraise the terrain ahead, from the smoothest paths to the rockiest ones. Let’s treat our plans as flares, and let them light our way.