Imagine, for a moment, that you are visiting a small village on the tropical island of San Serriffe. At the hotel counter the clerk is describing the local attractions: orchards to the south, a beach to the east, the Haunted Forest to the north, and a lake to the west. “The fishing this time of year is especially—”
You interrupt them. “Wait, what?”
“Oh, yes, the Haunted Forest. Bit of a nuisance, really. Can’t stay out past sundown or the ghosts will get you. Nothing we can do, it’s been around for too long to fix now.”
As a traveler, this would bother you. At least, I hope it would bother you! But as engineers we accept this sort of thing daily. Our industry’s favorite euphemism is “tech debt,” wherein the team takes on the recurring obligation of maintenance work in exchange for quick progress on an important project. But this use of “debt” is misleading: Bad code doesn’t have predictable fixed costs. It impedes all nearby work, traps the unwary, discourages the inexperienced, and exhausts the veterans. Logs from files deleted last year, documentation for features that were never written, tests that fail at midnight. Bad code is haunted, and a sufficiently large thicket is a haunted forest.
Any team would agree that preventing haunted forests is important, but there is less consensus about what to do when one is discovered. Healthy engineering organizations take vigorous action to detect, isolate, and replace code that’s become haunted. Otherwise the forest grows stranger and spookier, and the cost of exorcising it can balloon beyond the business value of the entire project.
Identifying a haunted forest
For the good are good simply, but the bad are bad in every sort of way. — Aristotle
Not all intimidating or unmaintained systems are haunted. Newcomers may find it difficult to navigate a codebase full of subtle intended behaviors; a stable implementation of some RFC might remain unchanged through a decade of shifting fashion. When deciding whether a codebase is unsalvageable, look at the relationship between the code and the engineers who work with it.
Some rules of thumb to identify code worthy of a complete rewrite:
Nobody understands how the system should behave. Not knowing what it currently does is materially different from not knowing what it should do. The former is amenable to standard testing and refactoring processes; the latter can only be solved by redesigning from core principles. After a new design has been completed, it may turn out that some parts of the old code (e.g., test cases or UI components) are still useful and can be salvaged.
It’s obvious that the current implementation isn’t acceptable. Look for systems that even experienced contributors dislike working on, or that have a history of burning out people assigned to them. Give more weight to the opinions of direct contributors—you’ll sometimes hear objections to a rewrite from people who haven’t worked directly on the bad code but have opinions about it anyway. Let them know that you can arrange for their temporary rotation into the role of Haunted Forest Ranger.
The system’s missing features or erroneous behaviors are impacting other teams. This can be subtle. Look for signs that other teams are aware of and working around problems caused by the system in question. This might manifest as observation diaries containing notes on what causes the system to behave incorrectly, or as other tools that are handling responsibilities beyond their scope to avoid a dependency. In an organization with a healthy attitude toward cross-team collaboration, these indicate that other teams were unable to fix the problem despite high motivation.
A competent engineer has tried, and failed, to improve the existing code. Every engineering org has members who are drawn to broken things and will try to fix them. A haunted forest’s revision history will often show signs of their activity—evidence of exploratory fixes and swift retreat once the full scope of the problem revealed itself. Look for projects where otherwise successful people have been unable to make headway for compelling technical reasons.
The codebase is resistant to automated tooling. Static analysis, unit testing, and interactive debuggers are high-leverage tools for working within a healthy codebase. If the structure of the project prevents their use, it can be difficult to land meaningful improvements even in the absence of other problems. Large codebases written in dynamically typed languages are especially prone to this issue, with metaprogramming like __getattr__
or method_missing
being substitutes for and inhibitors of sustainable development practices.
Negotiating a fear of the dark
Avoiding danger is no safer in the long run than outright exposure. The fearful are caught as often as the bold. — Helen Keller
Fresh graduates often push for a rewrite at the first sign of complexity because they’ve spent the last four years in an environment where codebase lifetimes are measured in weeks. After their first unsuccessful rewrite they will evolve into junior engineers, repeating the parable of Chesterton’s fence and linking to that old Joel Spolsky article about Netscape.
Be careful not to confuse their reactive anti-rewrite sentiments with true objections to this particular rewrite. Remind them that Spolsky’s article, “Things You Should Never Do, Part I,” was written in early 2000, four years before the release of Firefox (a rewrite of Netscape) and eight years before Chrome (a rewrite of KHTML). The true moral of this story is that rewrites are a good idea—if the new version will be better.
Documenting the current behavior of the old system is good because it helps inform the design of the new system and identifies potential points of interaction with users. But be careful not to spend too much time documenting the old system’s problems. This isn’t an exercise in surveying—time spent writing lots of docs about weird behavior is unhelpful. However, changing the behavior to be non-weird is very helpful.
Clearing the dead wood
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system. — John Gall
Rewriting an existing codebase should be considered a special case of a migration. Don’t try to replace the whole thing at once. Instead, follow the four basic principles of any big migration: identify how users interact with the existing system, insert strong API boundaries between subsystems, make changes intentionally, and work incrementally.
User interaction will make or break your rewrite. You must understand what the touchpoints are for users of the existing system so that you can maintain UI compatibility throughout the migration. These might be minor (wrappers around an existing CLI) or major (direct access to the backend datastore). Try to batch up user-facing changes: A single clear cutover, with users in control of the timing, will go over better than drip-feeding workflow disruptions over weeks or months. If the user-facing changes are significant, see if you can arrange for separate opt-in and opt-out periods during which both interaction modes coexist.
Subsystem API boundaries ensure that there are no unexpected communication channels between the newly independent subsystems as you carve the old system into chunks. Be fairly strict about this: Run them in separate processes, separate machines, or whatever is needed to guarantee complete visibility and control over how data moves. Do this recursively until the components are small enough that rewriting each one from scratch is tedious instead of frightening.
Make changes intentionally so that you understand how the new system will respond, and why, for any given input. Keep a record of intentional deviations between the systems, and communicate them to users as you would any other backward-incompatible change. By this point you should have a good idea of what constitutes correct behavior. If there’s no single correct behavior, it’s fine to settle for “predictable,” or at least “deterministic.” You are guaranteed to discover that some of the old system’s obvious bugs are not obviously bugs to its users, and catching that early may be the difference between a small revert commit and an emergency rollback in production.
Work incrementally. A good rewrite is valid and fully functional at any given checkpoint, which may be commits or nightly builds or tagged releases. The important thing is that you never get into a state where you’re forced to roll back a functional part of the new system due to breakage in another part.
Preventing paranormal activity
I am now—what joy to hear it!—of the old magician rid; And henceforth shall every spirit do whate’er by me is bid. — Goethe, “The Sorcerer’s Apprentice”
Some languages and programming styles are easy to misuse. Unfortunately, these are also the most popular choices among less experienced developers, who may not realize what they’re getting into until their part of the project is well and truly cursed. Depending on the experience level of the team, it may be useful to use lint rules or code review to restrict access to dangerous knowledge. Summon these dark powers with care.
Consider nonlocal control flow. The most fundamental debugging technique available to software engineers—reading—is unusable if the reader can’t tell how a code path is being invoked or what it’s invoking. In the old days we wrestled with setjmp
and longjmp
and goto
; contemporary languages’ trampolines and event loops pose similar challenges. Central callback dispatchers, including (but by no means limited to) Python’s Twisted and Ruby’s EventMachine, convert a static call graph into a sequence of unconnected function calls without meaningful stack traces.
Mutable global variables, dynamic scoping, and other forms of hidden state can make clear-looking code do something totally unexpected. Superficial simplicity with hidden complexity is like catnip for junior developers, who value succinct code but haven’t yet been forced to debug someone else’s succinct code at 3 a.m. on a Sunday.
Dynamic types require careful and thoughtful programming practices to avoid turning into type soup. Tooling such as Mypy and Sorbet can help here, but introducing them into an existing haunted forest may be infeasible if the code doesn’t already have a consistent type model. Use them in the new codebase from the start, and they might be useful when reclaiming portions of the original work.
Distributed systems can become haunted forests through sheer size, once no single person is capable of understanding the entire API surface they provide. Note that microservices don’t automatically prevent this, because naively splitting up a monolith turns the entire internal structure into an API surface. Each of the above per-process issues has distributed analogues—for example, S3 is a global mutable state and JSON-over-HTTP is dynamically typed. Similarly, a statically typed schema language such as Protobuf may be situationally useful but difficult to retrofit into an existing protocol.
Remembering forests past
Why all this guesswork? You can see what needs to be done. If you can see the road, follow it. — Marcus Aurelius
There are two large system rewrites that shaped how I think about the problem of, and solution to, haunted forests. One, during my last year at Google, provided both the framework and the name: John Reese’s “No Haunted Graveyards,” written in 2015, is an internal document that makes a case for thoroughly understanding existing systems. The other, my first year at Stripe, was a chance to see if my approach to large rewrites could survive outside of Google’s carefully tended environment. I can’t claim a universal truth, but the fact that two dissimilar codebases faced the same basic problems and were responsive to the same interventions gives me some confidence that this model is broadly useful.
The first system was a data center capacity management tool which tracked allocation of compute resources (CPU, RAM, etc.) among product areas. One of the earliest design decisions was to distribute the core logic as a C++ library linked into every client, which made API changes impractically difficult. By the time I joined the original project it had been around for several years and had survived significant tidal shifts in scope and purpose, such that every user-facing operation had to filter through sedimentary layers of complex business logic.
The second system was a software deployment tool, similar in purpose to Spinnaker, that was originally written by an intern. It had endured largely unchanged as the company grew from a dozen engineers to several hundred, fending off all attempts to tame its custom and highly dynamic async callback dispatcher. My first project as a new employee was to add metrics and measure how long deployments took—but landing those metrics took a frustrating two weeks. It wasn’t long before I started drawing up plans for a more radical redesign, eventually convincing the team that a rewrite was desirable and feasible.
Both systems easily met all five criteria for a haunted forest. While the details differed greatly, there were a couple design choices that in retrospect made improvement particularly difficult.
First, all interaction happened via a set of command-line tools that had dozens (Stripe) or hundreds (Google) of options, many of which were obsolete or irrelevant. These tools had been wrapped by scripts, web UIs, and automated workflows managed by teams across the company, so even minor changes in output formatting or error messages could break downstream users. The CLIs’ terminology and data models were from previous iterations of the system, but couldn’t be altered for compatibility reasons.
Second, the actual behavior—what changes any given operation would effect on the data store—was impossible to conceptualize or reason about. Even if given complete snapshots of the database before and after, we would have no way of knowing whether an operation had been performed correctly. The test suites validated fragile internal properties, such as the order of semantically unordered data, and avoided unit testing in favor of complex, minutes-long integration tests.
Both rewrites were uneventful and took about a year to finish. Along the way, users saw minimal disruption to their day-to-day workflows, while the incremental process let us safely roll out sweeping changes without any hard cutovers. In some cases we were also able to fast-track the schedule of dependent projects by building support directly into the new API. In the end, once the last of the old ghosts were gone, the islanders could cross “haunted forest” off their map for good.