In praise of property-based testing

In praise of property-based testing

Example-based tests hinge on a single scenario. Property-based tests get to the root of software behavior across multiple parameters.
Part of
Issue 10 August 2019

Testing

Property-based testing is a style of testing that originated with the Haskell library QuickCheck. I’ve been working on bringing it into the mainstream since early 2015, when I released the Python library Hypothesis, which has since seen fairly widespread adoption. I’d like to tell you a bit about what property-based testing is and why it matters.

Traditional, or example-based, testing specifies the behavior of your software by writing examples of it—each test sets up a single concrete scenario and asserts how the software should behave in that scenario. Property-based tests take these concrete scenarios and generalize them by focusing on which features of the scenario are essential and which are allowed to vary. This results in cleaner tests that better specify the software’s behavior—and that better uncover bugs missed by traditional testing.

What’s wrong with examples?

The problem with example-based tests is that they end up making far stronger claims than they are actually able to demonstrate. Property-based tests improve on that by expressing exactly the circumstances in which our tests should be expected to pass. Example-based tests use a concrete scenario to suggest a general claim about the system’s behavior, while property-based tests directly focus on that general claim. Property-based testing libraries, meanwhile, provide the tools to test claims.

An example-based example

To see how this shift in focus works, take a look at a fairly typical example-based test. Suppose we’re testing a web application that allows users to collaborate on projects. Projects have a maximum number of collaborators, and we want to be able to add users up to that limit. To validate that we can do this, we write the following test:

from .models import User, Project 
from django.test import TestCase

class TestProjectManagement(TestCase):
def test_can_add_users_up_to_collaborator_limit(self):
project = Project.objects.create(
collaborator_limit=3,
name="Some project"
)
alex = User.objects.create(email="alex@example.com")
kim = User.objects.create(email="kim@example.com")
pat = User.objects.create(email="pat@example.com")
project.add_user(alex)
project.add_user(kim)
project.add_user(pat)
self.assertTrue(project.team_contains(alex))
self.assertTrue(project.team_contains(kim))
self.assertTrue(project.team_contains(pat))

In order to test our general claim (that we can add users up to the collaborator limit of the project), we have written a test of a specific instance of that general claim. This isn’t an unreasonable thing to do: If this test fails, our claim is surely false. The problem is that if our test passes, it doesn’t tell us much about the claim itself—it just tells us that the test passes. This is not reflected well in the test name. The title “test_can_add_users_up_to_collaborator_limit” sure sounds like a general claim. A more accurate name would be “test_can_add_three_users_at_the_same_domain_to_a_project_with_a_collaborator_limit_of_3”—but this is unlikely to be a particularly popular naming convention.

This is the fundamental problem of example-based testing: We often treat our tests as specifications, but in reality they are stories. Worse, they’re often shaggy-dog stories, full of a mess of random details, and we get no clue as to which parts of the test actually matter and which parts are just a distraction.

Take a look at the test on the previous page. Which of the details matter? Presumably—hopefully!—the project name is irrelevant. Does it matter that the collaborator limit is specifically three? Probably not, but it might matter that it’s greater than one. Does it matter that the users all have an email address at the same domain? Maybe, but the test doesn’t say.

A property-based example

Property-based testing is about removing those extraneous details, and property-based testing libraries are tools to help us do so.

The following is how we might write the same test using Hypothesis:

from .models import User, Project 
from hypothesis.extra.django import TestCase
from hypothesis import given
from hypothesis.extra.django.models import models
from hypothesis.strategies import text, lists

class TestProjectManagement(TestCase):
@given(
text(),
lists(models(User), unique_by=lambda u: u.email)
)

def test_can_add_users_up_to_collaborator_limit(
self, project_name, collaborators
):
project = Project.create(
name=project_name,
collaborator_limit=len(collaborators)
)
for c in collaborators:
project.add_user(c)
for c in collaborators:
self.assertTrue(project.team_contains(c))

This is the same test, but the details we were unsure of are now allowed to vary: Instead of a fixed project name or set of users, we have said that this works for any project name and any list of distinct users. The test now captures exactly our original intent, because we’ve abstracted away the details that are unimportant by allowing them to vary.

The way this works is that Hypothesis uses an @given decorator to let us specify a range of valid inputs to a test. These inputs are specified using strategies, which describe the range of valid values for the argument to take.

In this case, our project name may be any string, and our collaborators may be any list of user model objects, as long as they all have distinct email addresses. A test written in this way is expected to pass for any possible argument allowed by its strategies.

What happens when we run this new version of the test? Well, it passes. It would require a fairly obtuse implementation to fail the original test and pass this one. Here, the improvement from the original test is purely in terms of making this a better-factored test, but that’s already a big increase in the quality of our test suite.

Making assumptions explicit

It’s common for people to use fixtures and factory libraries in order to reduce the tedium of setting up their data over and over again. This causes the tests to depend (in subtle and unintentional ways) on the details of the fixture data, and to become increasingly brittle as a result. Property-based testing avoids that brittleness by insisting the details that shouldn’t matter are allowed to vary, making it impossible for tests to depend on them. The result is a significantly cleaner and more robust test suite, which makes fewer implicit assumptions about fixture data.

It goes further than this! By forcing us to precisely describe the behavior of our software, property-based testing in turn forces us to make explicit not just the assumptions that we made when writing the tests, but also the assumptions that we made when writing the software. Often we will discover that those assumptions are wrong.

Let’s take a look at two examples of this sort of wrong assumption, taken from real-world (but old) bugs found by Hypothesis.

What inputs are valid?

We wrote some tests for Python bindings to Argon2, an award-winning password hashing library. The test itself is fairly straightforward (take a password, hash it, verify the original password against the hash), but the interesting feature is that Argon2 takes a great many configuration options to control difficulty, and by allowing those to vary we exposed a (fairly harmless) bug in the underlying implementation:

from argon2 import PasswordHasher 
from hypothesis import given, assume
import hypothesis.strategies as st

class TestPasswordHasherWithHypothesis(object):
@given(
password=st.text(),
time_cost=st.integers(1, 10),
parallelism=st.integers(1, 10),
memory_cost=st.integers(8, 2048),
hash_len=st.integers(12, 1000),
salt_len=st.integers(8, 1000)
)

def test_a_password_verifies(
self, password, time_cost, parallelism, memory_cost, hash_len, salt_len
):
# Reject examples with a memory cost of less than 8 per thread as invalid
# (these would raise an error when constructing the password hasher)
assume(parallelism * 8 <= memory_cost)
ph = PasswordHasher(
time_cost=time_cost, parallelism=parallelism,
memory_cost=memory_cost,
hash_len=hash_len, salt_len=salt_len
)
hash = ph.hash(password)
assert ph.verify(hash, password)

Here the test does fail. When this happens, Hypothesis prints the specific combination of arguments that trigger that failure:

Falsifying example: test_a_password_verifies(
password='', time_cost=1, parallelism=1, memory_cost=8, hash_len=513, salt_len=8
)

The bug is that if the hash length is greater than 512, it hits an internal fixed-size buffer in the underlying C library, which causes the verification to go wrong. The hashing silently seems to have worked, but the resulting hash will not verify the original password.

How did we get this output? This is where the property-based testing library, in this case Hypothesis, comes in. When we run a test in Hypothesis, here’s what happens:

  1. Hypothesis checks its cache to see if it has a previous failing example.

  2. If not, it attempts to generate a failing example.

  3. If it has no failing example, it passes the test.

  4. If it has a failing example, it applies shrinking to attempt to find a simpler set of parameters triggering the same bug.

  5. Hypothesis reports the simplified parameter values to the end user.

Shrinking, in particular, is one of the big benefits of using property-based testing libraries over fixture libraries with random generation. Debugging randomly generated values puts us back in the situation of having to ask what details matter. And randomly generated values are often worse than if a human had written them, because they are large and messy— which makes it hard to pick out what matters.

In contrast, consider the falsifying example on the previous page: All of the parameters are the smallest value they can be—except for hash_len, which is instead the smallest value it can be while still triggering the bug. If we replaced 513 with 512, the bug would go away. When we work with these shrunk examples, the details that are responsible for the failure tend to stand out.

This is the first type of assumption that property-based testing helps uncover: assumptions about what sorts of inputs will call up certain functions. Property-based tests require us to be explicit about the valid range, which helps us find out what it actually is, rather than just testing the happy path.

Uncovering differing assumptions

Another common source of wrong assumptions is when parts of the software are written by different people—either because we’re using third-party libraries or just because there are multiple people on the team. When assumptions are implicit, it can be hard to notice when different people make different ones.

The following is a fun example of this sort of mismatch. This comes from a library called BinaryOrNot, whose job is to heuristically detect if a file was meant to be a binary or a text file:

from binaryornot.helpers import is_binary_string 
from hypothesis import given
from hypothesis.strategies import binary

@given(binary())
def test_never_crashes(s):
is_binary_string(s)

Here we have generated a byte string to test and passed it to the is_binary_string function. Note that we haven’t even checked if it does anything sensible! We’re simply checking if the function raises an error. This is often a very useful test to write with property-based testing, as it’s easy to implement and often flushes out a surprising number of errors. Using assertions or contracts, we can even improve it further by making sure that our code crashes in cases where it might otherwise have silently failed.

In this case, the test failed with the input s=b'\xae\xc5\xdc'm, caused by a Unicode decoding error. Why? Well, because it depended on another library named Chardet. Chardet does heuristic prediction of the intended encoding of text files. In this case, the logic was that BinaryOrNot thought this string might be text, asked Chardet to predict its encoding, and was told that it was a specific encoding with 100 percent confidence. Unfortunately, it turns out that when Chardet says there is no intended implication, this actually means that it’s a valid sequence of bytes for that encoding.

This is documented behavior, and has a reasonable enough motivation, but it’s also surprising enough that the BinaryOrNot authors had (equally reasonably) never considered the possibility. There was a mismatch between the contract they believed that Chardet followed and the one that it actually followed, but this mismatch did not arise until their code was tested with property-based testing.

More powerful property-based tests

We now get to where most property-based testing articles start: the sorts of tests that only really make sense to write when they’re property-based. Because property-based testing makes it easy to write tests that run over a wide range of parameters, it prompts us to think about what program claims we can make that are always true. These are the “properties” in “property-based testing.”

It can be quite hard to think of these properties, so I generally recommend that people don’t worry too much about them until they’ve gotten familiar with the basics of property-based testing and integrated it into their normal workflow. However, there’s one that’s easy and ubiquitous enough that it’s worth knowing about from the start: the encode/decode, or round-tripping, test.

When we have some data that we convert to a serialized representation, we can always check that serializing it and deserializing it produces the same result. This is useful for two reasons: First, essentially every nontrivial application serializes its state somewhere— into a database, into files, into an API. Second, such serialization usually has bugs, and as a result important information gets lost or corrupted when we transform from one format to another.

Here’s an example showing one of my favorite bugs found by Hypothesis:

from dateutil.parser import parse 
from hypothesis import given, settings
from hypothesis.strategies import datetimes

@given(datetimes())
def test_can_parse_iso_format(dt):
formatted = dt.isoformat()
assert formatted == parse(formatted).isoformat()

This tests the dateutil library’s ability to parse times in the ISO 8601 format (the one true date format!) as follows:

  1. Hypothesis takes an arbitrary datetime (a combined date and time object).

  2. It converts the datetime to ISO 8601.

  3. It parses that back to a date.

  4. Then Hypothesis checks that this date agrees with the original. (We check that they have the same ISO 8601 format rather than direct equality due to the intricacies of equality on time zone objects.)

This function fails when given 0005-01- 01T00:00:05, which it erroneously parses as 0001-05-01T00:00:05, swapping the year and the month.

This bug is extremely specific. It only occurs when the year is equal to the second, and it happens because of some ambiguity in how the parser interprets the date. It is very unlikely that it would ever have been found by a normal, human testing process, but chances are that some users would eventually have run into it. Hopefully they would have noticed, rather than having their data be silently corrupted. By writing property-based tests that assert the consistency of the data when converting it between different formats, we eliminate whole classes of subtle bugs from production.

Where to now?

To recap, adopting property-based testing will:

  • Bridge the gap between what we claim to be testing and what we’re actually testing.

  • Reveal the assumptions that we made during testing and development, and check if they are violated.

  • Expose subtle inconsistencies in our code that would be hard to detect with example-based testing.

But how do you get there if you’re starting from an example-based test suite? The most important thing is to just start. Most property-based testing libraries are designed with easy integrations for common testing frameworks, so it’s fairly easy to add to existing continuous integration. You can start small, adapting just a couple existing example-based tests into simple properties, using them as a gateway to get yourself over the initial hurdle of using property-based testing.

Once you’ve got a couple of property-based tests in your test suite, you can add new property-based tests in the course of normal development. “Can this test be a property-based test?” is a good question to ask during code review. And touching existing code is a good opportunity to generalize the tests before writing new features.

Alternatively, if you want to get started by jumping in at the deep end, it can be fun to take a day out and get the whole team together to work on adding property-based tests. This will go a lot further a lot faster, but be warned: You will probably find an awful lot of bugs!

About the author

David MacIver is a researcher and software developer, best known for Hypothesis, a property-based testing library for Python. After a decade in industry, he’s currently working on a PhD based on research that he completed in the course of writing Hypothesis.

@DRMacIver

Artwork by

Sean Suchara

seansuchara.info

Buy the print edition

Visit the Increment Store to purchase print issues.

Store

Continue Reading

Explore Topics

All Issues