Software engineering is a team activity. Nearly all projects are maintained by multiple people, and projects increasingly depend on numerous other libraries and pieces of code, be they third-party libraries or a module maintained by a separate team or set of individuals within the project. As such, a key challenge of writing and working on software can be managing coordination and communication. How do you make sure everyone has the information they need to get their job done? How can you safely make changes that impact others’ workflows and code? Tests, test suites, and the practices we use to write and run them can be powerful and explicit tools in this quest.
Tests as documentation
Well-written test cases can often serve as direct examples of a tool’s API or usage. A test of a library’s public interface will consist of some number of calls into the library, followed by some amount of explanation (in the form of assertions) about the expected output or behavior. Given a good CI setup, tests are also guaranteed to be working examples, since they are machine-checked.
Since clear, working examples are a core element of effective documentation, this phenomenon is fortuitous: The process of writing a good test can give us a head start on our documentation. Languages and tools increasingly let you make this equivalence explicit: Python and Rust, for example, have “doctest” tools that let you write examples directly into your documentation, then execute them as tests alongside the rest of your test suite. Go similarly supports writing example functions that will be both rendered in API documentation and executed (and compared with their expected output) by go test
.
Most people write examples for the documentation, then use doctest to help ensure their correctness. This is certainly a useful and productive practice. But what if we repurposed our test suite as documentation—writing tests in a special format, then folding them into the documentation for free?
What if we took this line of thought even further: Can we think of all of our tests, even the ones we don’t express as doctests, from the perspective of a documentation writer? This encourages us to write more tests in terms of the user-visible behavior of a system, and in terms of the abstractions and concepts the library exposes to its users, rather than the internal abstractions and implementation details of the tool.
As an example, suppose an engineer—let’s call her Alice—is implementing a graphics toolkit and wants to test that the code can compose multiple translations and rotations properly. If her library represents transformations as matrices, she might initially think to write tests that build up transformations and assert that the resulting matrix is correct. If she takes up a documentation perspective, though, she might reason that her users care more about the result of applying a transformation. In that case, she’d test what happens when these transformations are applied to some shape. In either case, she’ll be testing essentially the same behavior, but the latter will be more readable to a user or a new developer of the library.
Tests of any kind can also provide a lens into the author’s intent. While a determined user can try to empirically figure out the behavior of a library in an edge case (by trying it and seeing what happens), tests provide key context that a behavior is deliberate and contemplated (instead of just being an accident of implementation).
Tests as translation
Test cases can help document a tool’s implementation details. In effect, they translate between these details and external behavior.
Suppose Alice is implementing a library and discovers a subtle implementation detail: Some nonobvious edge case needs special handling in order to be correct. As a careful member of her team, she wants to record not just the correct implementation, but also her understanding of its subtlety and why it matters.
She could write a comment, or perhaps a commit message, but finds it challenging to describe the crux of this situation and its impact: The interaction is nonlocal, and two distant parts of the system have some subtle interaction which has to be handled in one location but is really a result of the entire system’s design.
Instead, Alice decides to write a test case that fails if the detail is handled poorly and passes when handled correctly. Like all tests, this has the usual advantage of being executable and machine-checked: If a future refactor or modification results in handling the edge case improperly, the bug will hopefully be automatically noticed in CI, without requiring a developer to first internalize the edge case or its significance in the system.
However, and equally importantly, this test case can provide an explanation of the subtle implementation bug in terms of its observable impact on users. In particular, Alice can write the test so that a failure manifests as incorrect behavior in a user-visible way, not just as a numeric mismatch or an opaque “assertion failed” inside the implementation.
Returning to Alice’s graphics library, perhaps she has found some subtle floating-point precision bug in a low-level drawing operation. When she fixes it, she might write a test that demonstrates the bug not by checking for an incorrect numeric value, but by making a sequence of API calls that actually draw the wrong pixels because of the error. Armed with this test, a future developer can see this bug not merely as an obscure question of a few decimal places of precision, but as (literally) visibly incorrect behavior.
This technique is invaluable for code reviewers, too. If Alice spots a subtle issue during code review, she can communicate it to the original author by writing a test case that demonstrates the problem. This makes the bug she’s identified unambiguous, and clarifies whether the author has fixed it. They need only rerun the test!
Tests as relationships
If Alice maintains a software library or tool with many users, the question of whether a change in her tool will break for some of her users is a difficult (and inevitable) one. She could try limiting the surface area of her interface by carefully using language-level visibility declarations, but it’s well known that the implementation details of libraries and systems have a habit of leaking out. No matter how careful the original authors are, any behavior in a system will end up being depended on in some critical way by at least some downstream users—a phenomenon known as Hyrum’s law.
Unexpected breakage of downstream projects can seem hard, even impossible, to avoid, but test suites can provide a powerful tool for detecting problems early, as well as creating better feedback loops between Alice and her users. We presume Alice has a test suite of her own, which tests her tool as best it can. But while this suite will be helpful, experience suggests it’s hard or impossible for it to capture all of the behaviors Alice’s users rely on in practice. However, if Alice is able to easily run her users’ test suites against new versions of her tool, she gains not just additional test coverage, but also the ability to specifically test the parts of the tool her users are relying on. She can thus learn of potential breaking changes for her users before release, and either avoid them or provide patches to help her users adapt to the new behavior.
Because this is a repeated cycle, Alice can go one step further: Running her users’ test suites can become an official part of her library’s life cycle and stability guarantees. If Alice makes it easy for her users to provide her with their test suites, she can promise to run them before each release, and to investigate and take action on any failures. She need not commit to avoiding all breakages—she might choose to act on a failure by communicating with the downstream author or sending that author a pull request—but she will at least be informed of failures before release, and use appropriate judgment to choose how to respond.
She’s effectively crowdsourcing additional test coverage, while also incentivizing her users to write good tests of the ways they make use of her library: If they write good tests, then the work of dealing with accidental breaking changes is largely pushed upstream to Alice, making their lives easier. Alice—because she cares about stability—welcomes this work, since it gives her a much faster feedback loop than triaging bug reports, and lets her move more quickly and with greater confidence.
Projects or organizations with monorepos have this architecture built in. This phenomenon can be tremendously accelerating for teams that write core tools and libraries: They can easily test a new version against all of their users, and even fix up users’ code that depends on an old behavior, all in a single changeset. Google, which infamously maintains an enormous monorepo, refers to the social contract they have built around this setup as the “Beyoncé rule”: “If you liked it, you should have put a test on it.” In effect, this means that tooling teams will feel comfortable making changes and designing new versions as long as all of their users’ tests pass; if they accidentally cause a breaking change for a user but no users’ tests failed, the lesson is taken to be “write better tests for the broken behavior,” not “avoid making that change next time.” Granting teams this level of confidence to make improvements helps libraries and tools to move more quickly and improve, to the ultimate benefit of the entire codebase.
This technique is not limited to monorepos, and the availability of cloud computing and CI services makes it increasingly feasible for distributed open-source development. Rust’s “crater” tool, which can automatically run tests from every project in their public package repository against a development version of the compiler or standard library, is another implementation of just this pattern.
Development as communication
Increasingly, the challenge of working on software is not just one of writing or understanding code yourself, but of figuring out how to coordinate your work with that of others. Much of the development of modern software engineering practices has been devoted to building tools and techniques to scale the coordination and communication aspects of building ever-larger and more complex software systems. Test suites and testing practices can be essential tools in this process.
As we design software projects, let’s hold these goals in mind as we implement test suites and test harnesses. Our test suites are living, breathing participants in the ongoing project of creating shared understanding among present and future developers and users of our systems. Let’s start treating them that way.