Recently, Duolingo launched one of its most anticipated language courses: Arabic. The addition of Arabic—along with Hindi, Chinese, Spanish, and English—allowed Duolingo to achieve one of its long-term goals: providing free courses for all of the world’s top five spoken languages.
Spoken by over 400 million people worldwide, Arabic has long been on our roadmap, but it came with a unique set of challenges. We saved it for last out of the “Big Five” so that we could incrementally tackle new challenges with each new launch. Our Chinese course launch, for example, forced us to create ways to teach a writing system that’s very different from English and other Latin-script languages. From the launch of Hindi, we learned how to teach and test an alphabet-based and scripted language that is not always clearly delimited by white space.
While working on Arabic introduced the new challenge of supporting a right-to-left language, it also provided a chance to test improvements in our course deployment process since the Hindi launch. Hindi was the first of the Big Five for which we used a new content creation process. It allowed our team to create the course much faster, but it also meant we needed to test every aspect, from content to functionality, that much more thoroughly.
Limitations of automated testing: Handling nondeterminism
Learning a language with Duolingo is very different from learning from a textbook or a set curriculum. Proofreading a block of curriculum text would be fairly straightforward— but Duolingo is a dynamic app. Through our main engine, which we call Session Generator, we generate a different series of exercises depending on what we know about a user’s learning needs, settings, platform, and so on. In other words, we adapt to the learner.
Duolingo is the most downloaded education app in the world, and we attribute much of this success to our culture of A/B testing everything. From large features to a button color change, every modification to the product is an experiment meticulously run on a small portion of users to understand the effect on retention and teaching effectiveness. We’ve run over 2,000 experiments, and at any given moment there may be more than 100 A/B experiments running on the platform. Assuming there are only two conditions per experiment and 100 running experiments, that’s 2100 possible experiences a given user might have. Although A/B testing is useful for continuous improvement and accountability, it’s a nightmare for automated testing coverage; it’s just not feasible to test every possible state of the app through automation.
Because of the dynamic nature of the Duolingo experience, we have to take a more flexible approach to testing course launches. We have automated tests to cover the generally unchanging critical path, and we utilize our QA team to test all of the conditions of the largest experiments. For many smaller experiments, we concede that bugs may happen, but we also give ourselves a way to detect and fix them quickly. Our approach is to anticipate what we can, ensure good monitoring, and offer prompt fixes.
Preventing issues before they happen
Step one of a smooth launch starts well beforehand. The Duolingo Language Incubator, where volunteers, contractors, and staff members input content data, is equipped with a ton of data validation logic that accounts for even the smallest details. For example, we don’t allow saving a translation of a sentence that has mismatched punctuation. We don’t save a multiple-choice exercise if not enough distractor options (i.e., incorrect responses) are provided. And we add a small layer of gamification and metrics to the interface to incentivize completion and thoroughness, ensuring that we have good coverage of all of the content we wish to teach in a given course. By the time a course is ready to launch, a lot of validation will have already happened in real time as data is entered.
Testbed: Production disguised as a staging server
We have a unique setup when it comes to staging a course: We don’t. Every time content for a course is changed in the Incubator, it syncs with production’s caching layers, which are used to serve content to our learners. The ground truth data always lives in the Incubator, but it’s constantly synced with production, regardless of whether the course is live. When it’s time, launching a course is simply a matter of listing the course as available, with no further data syncing required.
A server called Testbed runs production code and has access to production resources. Testbed is accessible from the Incubator only through an authenticated route, and the only difference between it and production is that we enable a local variable that tells the server to enable pre-launched courses. This means we can try the courses as they’re being built, well before they’re ready for launch. This has helped us detect issues early so that we can fix them while content is still being created.
We take this approach to ensure that what we test is exactly what users will experience. In the past, we’ve run into issues where staging servers didn’t behave identically to production—maybe some staging-specific feature flag was toggled, debug tools were enabled, caches weren’t enabled, and so on. Testbed has been a lifesaver in terms of identifying issues before our users see them.
Dogfooding and alpha testing through an A/B testing framework
After our contributors and/or staff lightly test a course on Testbed, we open it up to a wider audience. We believe strongly in “dogfooding”—that is, trying our own product. Since our A/B testing framework is already equipped to assign a condition based on user ID (with the ability to manually set a condition for a user), we sometimes leverage it to preemptively release features to a specific group of users. For a course launch, we gate the ability to start the course on an experiment condition. We can then put the entire company—as well as any eager alpha testers we recruit from our volunteer community or the Duolingo forums—into the condition.
When we were preparing to launch Arabic, the dogfooding process involved ensuring that the text-to-speech feature functioned properly (we used Amazon Polly’s new Arabic TTS); that the image-based challenges appeared in the course; that hints appeared in each lesson; and much more.
One large benefit of this approach to alpha testing is that it’s done entirely on the server, so there’s no need to manage test builds for Android and iOS. All testers can use their own phones and the normal Duolingo app or website, and updates and bug fixes are immediately propagated.
We catch many corner cases and small bugs at this stage. Very few would block launch, but they do go into our bug-triage process for prioritization. We also get a deeper sense of how the course feels holistically, since this is when people actually try to start learning the language. The input of beginner learners is invaluable, since up to this stage, the course has been tested mostly by people fluent in the language.
Using web for soft launch
After addressing any pressing issues that arise during dogfooding and alpha testing, we’re finally ready to launch the course to the general public. We always do a soft launch first so a few users will find it and start it. (Soft launch lasts about a day if we anticipate press around a course launch; it may be longer for a smaller course.) This conservative rollout allows us to monitor things like how well our resources scale to accommodate this course, and is essentially an extension of alpha testing. We always soft launch on our web platform because:
Our forums, which are a great source of bug reports and feedback, are on our website.
It’s generally faster and easier to revert or make changes to the website than to our mobile apps.
Our web platform does not support offline content. Our mobile apps, in contrast, download and store sessions to provide a smoother experience and allow users to do things offline. If we find something wrong with the course, we wouldn’t want the apps to store bad content locally as it would be a hassle to clean up later.
Launching on mobile platforms using preemptive feature flags
The final step is to release the course on our mobile apps. We have weekly builds for both Android and iOS; in anticipation of a course launch, we add support for a course several weeks in advance so that users have time to update their app versions before the official launch. The ability to start the course is gated by a feature flag, which we then simply switch on to launch on all platforms.
Postlaunch: Leveraging the crowd for continuous improvement
A course launch is just the beginning. Though we celebrated the launch of Arabic (and all of our courses) with enthusiasm, every course on Duolingo is constantly evolving and improving: Learners can report bad translations or sentences, poor audio quality, or any other errors they may find. But learners don’t need to intentionally report issues to help us improve—we’re constantly tracking session failures, challenge success rates, and other metrics to predict whether a sentence is missing a valid translation. We monitor our reporting system, social media, and forums for bug reports. The power of the crowd allows us to surface the most commonly reported data issues and fix them quickly in the Incubator without the need for code changes.