I’ve been around the block a few times when it comes to open source, dabbling and delving into projects of all shapes and sizes and seeking out opportunities to learn.
When I’m exploring a new codebase, I follow a framework I like to call “open-source archeology.” It’s a set of steps that allows me to understand the processes and concepts associated with a project before I start contributing. Using the codebase of the Docker CLI, a set of open-source tools for interacting with Docker over the command line, as a guiding example, let’s walk through these steps together.
Starting the dig
Before looking at any of the code in the repo, I’ll sift through a project’s open issues to get a sense of the common bugs and feature gaps. This step will also help identify the key maintainers and suss out how different concepts in the project intersect with each other. After skimming the issues, I’ll pick one to steer my exploration: Instead of trying to understand everything at once, I’ll select a few things to try to understand in the context of that particular issue.
There are two basic levels to understanding a piece of software: understanding enough to know how it works, and understanding enough to know how it works and how it can be modified or improved. In my experience, reading through a codebase with the intent of making a change is the best way to gain that deeper level of understanding.
Because the Docker CLI is a fairly well-scoped and mature project, the feature requests tend to address advanced workflows in Docker. I’ve chosen an open issue (at the time of writing) that adds support for reading a Docker config from multiple locations. Now, it’s time to get hands-on with the code.
Excavating the source
These days, instead of using git clone
to grab a local copy of the codebase, I like to lean on GitHub’s command line application, which lets you grab a clone of a repository on your local machine with a single command:
$ gh repo clone docker/cli
I’ve long been a fan of the pen-and-paper approach to exploring codebases. Browsing through files, I’ll create lists and draw concept maps to decipher patterns within the code. Recently, I started taking digital notes and leveraging the CodeTour VS Code extension, which allows me to create annotated code walkthroughs. I can add descriptions to relevant directories and files, arrange them in a meaningful order, then walk back through them. It’s like creating a museum tour for code.
Some people prefer debugging as a way to understand codebases. This is a totally valid approach, but I find setting up a debuggable scenario is a shade more involved than annotating the codebase. Plus, with a lot of modern editor technologies, understanding the connections between different pieces of code is fairly effortless. You can delve pretty deep into a project without needing to involve a debugging scenario.
Deciphering the code
Now that we’ve got cloning squared away, we can talk about how to uncover the connections between different files. In the case of the Docker CLI, most of the actual CLI’s logic is implemented under a top-level CLI directory. It contains subdirectories that deal with particular concepts, like interfacing with the Docker registry or defining the functions associated with each command in the CLI. For this particular change, I’ll start by locating the configuration-related code paths in the codebase. Most of the logic is stored under the (aptly named) config
directory. The logic associated with reading the configuration file into an in-memory format lives in one source file and in an entrypoint LoadDefaultConfigFile
method, which is called when the CLI object is instantiated.
A lot of this exploration relies on identifying the places code is likely to live based on file names. It just goes to show how important naming is! Getting a sense of the code organization is one of the first steps for reading code—as with an essay outline, it provides a map to guide further exploration.
Once we have a basic understanding of the codebase, we can start filling in gaps and adding layers to our knowledge. I recommend examining the most public-facing APIs—in this case, the invocations a user would issue to the Docker CLI—and working backward from there. Alternatively, you can start by identifying the key types defined in a codebase and then fan out to determine where they’re instantiated or invoked.
There’s also a historical element to understanding a codebase, which brings us to another handy code exploration tool: git blame
. It allows you to explore the changesets associated with a particular piece of code, and can help pinpoint when particular features were added. The inception of a feature is a key part of its story, helping to reveal why it was implemented in the first place. Reviewing the pull request associated with a change can also provide valuable context into how it came to be. I’ve learned a lot about how projects work and how to write good code by reading the reviews associated with a change.
Looking beyond the dig
You’ve now built a solid foundation for the next steps in your open-source journey: triaging issues, troubleshooting them with users, and making code changes. I list code changes last for a reason: An understanding of the concepts allows you to contribute to open source in so many more ways.
It’s easy to assume that contributing to open source is mostly about code, but I’d argue that it’s really about the ideas behind the code. There’s plenty to do with an open-source project before writing a single line. Exploring a codebase is a fascinating blend of archeology, history, and literature that sets you up to contribute to a project with deeper comprehension and confidence.