When you’re dealing with lots of text, you’ll eventually want to know more about what’s going on. What is it about? Which names, companies, and concepts are mentioned? In a sentence rich with keywords, what’s the actual subject, and who is doing what to whom?
Humans have amassed seemingly endless amounts of text over the centuries—and on the internet today, people produce so much of it that automating text analysis often becomes necessary. Maybe you want to group incoming customer emails into categories automatically, so that they can be answered more quickly. Maybe you want to analyze mentions of your company in news articles over time and find out whether the mentions are positive or negative. While computers don’t understand text the way humans do, we can now teach them some approximation of it to help us automate our work. I work on spaCy, an open-source library for natural language processing (NLP) in Python, which helps users do exactly that.
I started working on spaCy pretty much right after it was first released in 2015. From day one, spaCy was able to claim some compelling advantages over existing solutions. It was seven to eight times faster, more accurate, and featured a simple design that made it highly usable. However, for all its focus on ease of use, it took a long time for the documentation to catch up to the capabilities.
Documentation is still the number one flaw in most developer tools, both open and closed source. In addition to the usual challenges, like time, motivation, and the curse of knowledge, we also faced some particular difficulties in developing documentation that would address the needs of our users. Natural language processing is an interdisciplinary field, and developers come to spaCy with vastly different backgrounds, perspectives, and problems to solve; we have to create useful documentation without relying on the notion of a “typical spaCy user.” The library also relies heavily on statistical models, so the behavior of some of its functions isn’t entirely predictable; we can’t always document precisely what users should expect or why a function does what it does. Finally, the Python ecosystem is fragmented across multiple versions, packaging solutions, and operating systems, making it difficult to provide simple installation instructions.
The solution to most of these problems has involved making the documentation more dynamic. I’m particularly pleased with the most recent improvement: interactive code examples that can be run straight from the browser, making it much easier to try out the function being documented to see how it works.
One thing that’s always struck me about the spaCy community is the broad range of backgrounds and experiences our users come from. We can only really assume one thing about them: They want to do something with text. There are linguists who don’t care very much about the Python ecosystem. Experienced Python programmers who are just getting into machine learning. Machine learning engineers who haven’t thought much about linguistics before. Developers at large companies who spend a lot of their time thinking about scalability. Data scientists who mostly work in Jupyter Notebooks in their browsers.
I started out as a front-end developer with a linguistics degree. One of the first things I built after I joined the spaCy project was a visualizer for syntactic dependencies. It was a fun demo, but it was also a crucial tool for developers who hadn’t spent much time thinking about the grammar of natural languages. Over the years, we realized that to support a diverse audience, the documentation needs to come from a variety of different perspectives—say, that of a linguist, a programmer, a data scientist—and should ideally make use of different media and formats.
For some users, starting off with a 101 guide that explains the most basic concepts is incredibly valuable. Others prefer getting straight to the heart of things, seeing a script that shows an end-to-end example in context. And a single user may need different types of documentation at different times. At one point, they might need to zoom out and read an overview of how everything fits together, in order to figure out how to approach their problem. At another point, they might need specific information about some feature or function. So the documentation needs to provide quick references, too.
Most machine learning systems today rely on supervised learning: You provide labeled input and output pairs, and get a program that can perform analogous computation for new data. This allows for an approach to software engineering that Andrej Karpathy, a prominent AI researcher and director of AI at Tesla, has termed “Software 2.0”—programming by example data:
Our approach is to specify some goal on the behavior of a desirable program (e.g., “satisfy a dataset of input-output pairs of examples,” or, “win a game of Go”), write a rough skeleton of the code (e.g., a neural net architecture) that identifies a subset of program space to search, and use the computational resources at our disposal to search this space for a program that works.
This programming paradigm also introduces new challenges for documentation: If you write a function, you can document its expected input and output. If something goes wrong, there’s usually an explanation, even if it’s just a bug in the software library itself. But what about statistical models?
Many developers who use spaCy start off by plugging in a pretrained model to make predictions on their text. This works well if the input text is similar to the data the model was trained on. But a model trained on newspaper text likely won’t perform very well on tweets. A model trained on data that was released five years ago has likely never seen the word “Tinder” (launched in 2012) and probably doesn’t know much about Snapchat (2011) either. Depending on their circumstances, these pretrained models may also struggle with the ever-changing media landscape.
import spacy # load a statistical model for English nlp = spacy.load('en_core_web_sm') # process a string of text and create a document object doc = nlp(u"The hysterical concern over how to pay for Bernie's plans is hilarious") # iterate over the named entities predicted by the statistical model for ent in doc.ents: print('Entity found:', ent.text, ent.label_) # Entity found: Bernie ORG
A named entity is a “real-world object” that is assigned a name—for example, a person, a country, or an organization. spaCy’s statistical models can predict those names based on their context. In this case, the model predicts that “Bernie” in this context is most likely an organization (ORG). That’s clearly wrong, but replace “Bernie” with “Sony” and it suddenly makes more sense. Or with “Bezos,” which is correctly predicted as a person.
Machine learning is very experimental, and, to be honest, those of us working in it don’t always know exactly what we’re doing. Some things function as expected. Many others don’t. Other things may even succeed wildly. Sometimes we know why something performs as well or as poorly as it does, but other times we have no idea. With that in mind, to make the technology useful and accessible, we don’t only have to document the intended output of the model if it makes all of its predictions correctly; we also have to provide guidance on what types of text the model will perform best on, how to design applications to mitigate the effects of errors, and how to improve the accuracy of the model for a specific use case.
For many users, the most important type of performance documentation we can provide is a set of repeatable benchmarks, so that our models can be compared directly against results from other researchers. This is something that many AI tools lack, especially commercial tools that have an interest in making their solutions seem like magical black boxes that never make any mistakes.
Measuring model accuracy
A model’s accuracy describes how often its predictions were correct under the given conditions. The “classic” evaluation of NLP algorithms is usually performed on the Penn Treebank corpus, consisting of Wall Street Journal articles from 1989. This is helpful because it allows researchers to benchmark their systems on the same data, and makes the results comparable. However, the corpus is very far removed from real-world use cases. To give our users a realistic perspective on how spaCy’s models perform, we provide the standard academic benchmarks, as well as a modern evaluation of the whole processing pipeline, from raw text to individual predictions.
To move past the black-box approach, we think it’s especially important to help developers customize the models to their own use cases. Providing documentation for training is difficult. If the training process isn’t producing good results, we don’t immediately know whether the problem is in the code the user executed, the settings they tried, or the examples they provided. Since training is experimental and iterative, users need to be able to reason about the choices they’re making and the results they’re getting back; it’s more about providing the right conceptual framework than troubleshooting each specific problem. That said, debugging is definitely part of the process. One thing that helped us a lot was to improve the errors and warnings system in spaCy, so that better information could be output during training. Across NLP and machine learning, the solution is often to increase the quantity or quality of the labeled examples used for training, which is why we developed our annotation tool, Prodigy.
Over the past few years, Python has established itself as the most popular language for machine learning and data science. It’s practical, productive, and, perhaps more importantly, it was in the right place at the right time. It may not be the fastest or the most convenient language, but developers appreciate its rich open-source ecosystem. Projects like Jupyter, which provides in-browser notebooks for developing and sharing code, as well as related open-source technologies for interactive computing, have also had a big impact on productivity, making it easier to spend less time worrying about setting up your development environment. (Writing a few lines of code and executing them in a friendly web-based notebook makes programming so much more approachable!)
Still, if you want to use a library, you need to install it, and this can be hard, no matter how experienced you are. We’ve always wanted to make spaCy available to as many developers as possible, and to support the most common configurations, operating systems, package managers, and Python versions. But cross-platform support comes with a price: We have to document a matrix of installation instructions and their often very subtle differences. Linear installation docs are really bad for this. If you’ve just discovered a cool library and can’t wait to try it, the last thing you want to do is read through paragraphs of instructions and caveats for platforms you don’t even care about.
To make it easier for users to find the right installation commands, I built a mini-library for adding an interactive widget to our documentation (inspired by PyTorch‘s “quickstart” section).
From issue 5
A crash course in compilers
Diving deeper into program language theory is a great way to grow as a developer. Here, we go through the essentials of using compilers in language design.
Of course, it’s never that easy. Like most performant Python libraries, spaCy is written in Cython, essentially a hybrid of Python and C/C++. This means that you either need to download a pre-compiled version, or compile the library locally on your system when you install it. If you’re lucky, a compiler is pre-installed on your system. If not, it might send you down a rabbit hole of C++ build tools and
Unfortunately, the installation process often becomes the first challenging experience users have to navigate with a new library, even before they are able to write their first line of code. This not only makes for a bad first impression, it also excludes developers who are new to the ecosystem or who don’t have much experience with package managers, compilers, or Python environments. We as library authors are working hard to make this less painful, but I’ve always felt like this isn’t enough. Wouldn’t it be much nicer if you could try things out before installing anything at all?
The more abstract the software and application, the harder it is to imagine how a usage example translates to your unique problem. Teaching a computer to recognize animals is fun, but how will this help me analyze legal documents? And what would this look like in German or French?
There are so many different things spaCy’s users might want to accomplish that we had to find a way to make our documentation flexible enough to suit their very varied needs. In my opinion, the best way to learn from examples and see if they work for you is to copy them, modify them, run them, and see what happens. Web technologies have a clear advantage here because, unlike Python, they run natively in the browser. Websites can actively use and run the code they document, which is great. Making a website execute Python, on the other hand, is difficult. In our case, we didn’t just need a Python interpreter—we also needed to pre-install spaCy and at least some of its model packages. It was also essential that every user have their own isolated environment so that they could start off with a clean slate, and so that we could safely
eval whatever they typed in.
Jupyter Notebook: An open-source web application for creating and sharing documents containing live code in a variety of languages, including Python and R.
Jupyter kernel: A built-in program that runs and introspects the user’s code within a Jupyter environment, e.g., to execute Python code.
JupyterHub: A multi-user server for Jupyter Notebooks to start and manage multiple instances, e.g., for a group of students or data scientists in an organization.
BinderHub: An open-source tool to build Docker images from GitHub repositories (using repo2docker) and connect them to JupyterHub, e.g., to serve reproducible Jupyter Notebooks on demand.
Binder: A public, ready-to-use BinderHub deployment.
By the time I had tried out all of these options, I had become determined to make the interactive docs of our dreams happen. As a popular open-source environment for developing code in the browser, Jupyter seemed like the way to go, but I worried that I’d have to set up the whole container infrastructure myself. Luckily, other developers had solved this problem before me and built Binder and BinderHub. Binder hosts Jupyter Notebooks and makes them easier to reproduce. Based on a GitHub repository, it builds a Docker image of the required dependencies and starts up a new instance of it on demand. This means that every user gets their own isolated environment and can execute Python code against a Jupyter kernel. Typically, this all happens within a notebook—but it doesn’t have to. You can also connect to the kernel directly, send it a string of code you want to run, receive the output back, and render it on your website. This capability allowed us to make the majority of our code examples both executable and editable. We’d found a solution for building dynamic docs that adapt to our users and a very diverse array of needs.
# execute code in your browser using Binder import spacy nlp = spacy.load('en_core_web_sm') # load a spaCy model doc = nlp(u"This is a sentence.") # process some text for token in doc: print(token.text, token.pos_) # print the word and its part-of-speech tag
An interactive code widget that connects to Binder, using my mini-library juniper.js.
The varied and interactive documentation we’ve managed to put together for spaCy wouldn’t have been possible without the work of countless other open-source developers. Thanks to modern web technologies, we’ve come a long way from the software manuals of yore. By seeking out and experimenting with different and better ways of producing documentation—identifying our users’ unique use cases and empathizing with their specific needs—we can deliver better open-source tools to more developers, and continue to compound improvements that contribute to the thriving free software ecosystem we have today.