When something goes wrong with a software application, service, or system, someone needs to be responsible for figuring out what went wrong and fixing it. In the tech industry, this set of tasks is usually referred to as “being on-call” for that software. Similarly to the practice of doctors being on-call at a hospital, a set of engineers is placed on an on-call rotation (meaning that they share the on-call responsibility with a team, and everyone on the team takes turns being on the rotation), during their on-call shifts they are paged any time something breaks (usually via an automated push notification on their smartphone, a text, or a call), and they are responsible for quickly responding to the page, fixing what broke, and making sure that the same problem never happens again. On-call engineers are the “first responders” of software engineering.
Historically, the responsibilities required to run large software applications and systems have been divvied up between two kinds of teams: so-called “development” teams, who are responsible for all tasks associated with building and adding new features to applications and systems, and so-called “operational” teams, who are responsible for running and maintaining them. On-call responsibilities have been viewed for a long time as being part of the operational workload, and developers have rarely been on-call for the software they build.
In the past several years, everything in the industry changed. It’s difficult to pinpoint exactly when the industry changed its mind about on-call responsibilities, but the “who”, the “where”, and the “why” are relatively straightforward to uncover and understand. To determine the state of the industry, Increment spoke with over thirty industry leaders about the “who” and the “why”, and what we learned from our conversations about the industry-wide movement to put developers on-call for their software.
The majority of the companies we surveyed used to divide engineering tasks between their technical teams in the old way: their development teams wrote the code (and sometimes did the testing), and then threw the debugging, the testing, the running, and the maintenance of the code over to an operational team. Over the past few years, most of these companies discovered that this approach to running software simply didn’t scale, and that developers felt a lack of ownership when they weren’t on-call for the code they wrote—most importantly, this lack of ownership translated into unreliable systems being built and run. To fix these scalability and reliability problems, they moved the operational workload onto the development teams, who quickly (though not painlessly) learned to build better, more resilient systems.
Google notoriously was one of the first companies in the tech world to realize that the old way of doing operational work wouldn’t and couldn’t scale at the level their systems required, so they created a new role for “Site Reliability Engineers” (SREs). These new SREs approached the operational tasks with a software engineering mindset: they automated away all of the operational grunt-work, and made the systems run more reliably. Nowadays, SREs at Google run, maintain, and are on-call only for the most important and stable services (like Ads, Gmail, and Search), while development teams carry the operational workload for other non-stable, non-critical services (which aren’t staffed by SREs). The SRE approach to operations is now credited with the success of Google’s systems—success that much of the industry has tried to emulate by adopting the SRE role and practices. However, many industry adopters have taken the SRE title without also adopting the SRE mindset or Google’s requirement that SREs only run and maintain stable systems: Google requires development teams to run their own services if those systems aren’t stable.
Spotify was one of the companies that adopted the SRE role early on, and treated SREs as typical operations engineers. In Spotify’s early days, their small SRE team was responsible for all operational work, including being on-call for all Spotify systems. As the company grew, and the operational workload grew alongside it, Spotify’s leadership discovered that they couldn’t hire SREs quickly enough to meet the operational demands. The only scalable solution they found was moving the on-call responsibilities to the development teams.
Airbnb discovered that having a separate operations team “creates a divide and simply doesn’t scale,” says Airbnb SRE manager Joey Parsons, and “it puts the onus of responsibility for fixing an issue on the wrong team.” Airbnb decided to put developers on-call for their systems, taking the stance that if developers can deploy to production whenever they want, then they should be the ones fixing problems caused by their services and deployments. Though Airbnb has SREs that work closely with development teams, their SREs focus only on improving reliability across systems, and they are the only team that is not on-call for any of Airbnb’s services. Many other companies, like Pinterest and New Relic, have followed a similar approach to that of Airbnb: developers are on-call for their services, but have SREs working alongside them (usually “embedded” within the team) to make sure that the development teams are following industry best practices for on-call and general service reliability.
Airbnb discovered that having a separate operations team “creates a divide and simply doesn’t scale.”
Some companies—like Datadog, Digital Ocean, and Dropbox—have focused on taking a shared, holistic approach to on-call responsibilities, and have put both development and operations teams on-call for services together. At Datadog, engineering leadership was determined to avoid an ops/dev split from the very start, and so they ensured that operational tasks were distributed between ops and dev teams. Importantly, SREs and developers at Datadog share the on-call rotations, ensuring that every on-call shift is staffed by both experts in the code (developers) and experts in reliability (the SREs). Dropbox takes a similar approach, viewing on-call responsibility as something that both development and SRE teams need to own. DigitalOcean has both development teams and operational teams on-call, but with a twist: development teams are on-call for their services, while operations teams are on-call for the interactions between the services.
PagerDuty, on the other hand, has what engineering manager Sweta Ackerman refers to as a “you build it you own it” and “end-to-end ownership” model: SREs are on-call and responsible for low-level infrastructure (like hardware, middleware, communication, databases, etc), while developers are on-call and responsible for everything on top of that infrastructure (including development, deployment, monitoring, and the hardware they run their services on). Ackerman says that PagerDuty had to switch to the shared-responsibility model two years ago, in an effort to ship features more quickly, encourage teams to “control their own destinies,” and to “reduce [inter-team] dependencies”—a model that the company has found wildly successful.
Amazon is famous (or, rather, infamous) for practically doing away with the operational role altogether, and were one of the first industry leaders to do so. Throughout all engineering organizations at Amazon (including AWS), developers are responsible for all development and operational tasks associated with their services. Putting the onus on developers to run, maintain, and be on-call for their services is part of Amazon’s cultural emphasis on “ownership”: you don’t “own” the code you write, Amazon says, unless you run and maintain it, too.
Putting the onus on developers to run, maintain, and be on-call for their services is part of Amazon’s cultural emphasis on “ownership”: you don’t “own” the code you write, Amazon says, unless you run and maintain it, too.
Netflix takes an approach similar to Amazon’s, with the motto “You own it, you run it.” Development teams at Netflix are on-call for their services 24/7, and there’s a Core SRE team that monitors services at a very high level and engages development teams only when large-scale outages occur. According to Netflix SRE Manager Blake Scrivener, “When something goes wrong [at Netflix], which our automation doesn’t handle correctly, we want the experts in the service to be immediately available to make the repair and [bring] stability to the customer experience…when things are broken, we want people with the best context trying to fix things.” In an engineering environment where services are being deployed multiple times a day, the people with the best context are almost always the development teams.
Out of all of the companies we surveyed, only Slack still had anything resembling an old-school operations team. Slack’s operations team, which is on-call for all of Slack’s services, is spread across the globe and uses a follow-the-sun rotation, with operations engineers located in Melbourne, Dublin, and San Francisco. “The decision to put operators on-call as the first responders is as old as the company itself,” says Richard Crowley, Director of Operations at Slack, because “historically, the things that broke tended to ultimately have contributing factors like hardware failures or network partitions.” Crowley says that they’ve recently started to see scalability problems with the old way of operations, however, which led Slack to create a secondary on-call rotation full of developers; software and performance bugs, he says, are becoming much more common than low-level infrastructure problems—bugs that only the development teams know how to fix. Given the industry trend, we don’t think it’ll be long before Slack joins the rest of the industry and puts their development teams on-call for all of their services.