Reliability at scale

Leaders at Deliveroo, DigitalOcean, Fastly, and Headspace share how their organizations think about reliability and resiliency and their advice to engineering orgs embarking on reliability journeys.
Part of
Issue 16 February 2021

Reliability

Victoria Puscas

Engineering manager, pricing and logistics algorithms
Deliveroo

2,500 employees

Laura Thomson

VP of engineering
Fastly

750+ employees

Al Sene

VP of engineering
DigitalOcean

500+ employees

Bhavini Soneji

VP of engineering
Headspace

250+ employees


How does your organization think about resiliency and reliability writ large?

Our engineering principles reflect our mission to prioritize high-quality code and emphasize our commitment to learning from mistakes. We conduct production incident reviews in a safe, nonjudgmental environment to understand both the root cause of a problem and the necessary steps to prepare the system and organization to successfully cope with similar problems in the future.

That’s also why we instrument our systems for observability. You don’t know you have a problem if you don’t monitor the health and performance of your systems—and have actionable, well-documented steps to mitigate issues. No service or system goes to production unless it has some basic monitoring in place.

Finally, we build to last. When building products and solutions, we ensure services will perform and scale given the business doubles in size annually. We run experiments and make data-driven decisions, which also means building services and products that can tolerate frequent change and are easy to extend or build on.

— Victoria Puscas, Deliveroo

We fall back on the resilience of our systems and people when reliability fails, so both are critical to our success. While some elements of a resilient team mirror a resilient network—avoiding single points of failure, for example—we know a resilient team’s foundation requires uniquely human traits like empathy, vulnerability, and understanding.

— Laura Thomson, Fastly

We encourage a culture of learning to continually get better at preventing incidents, reducing impact, and shortening recovery times. Incidents are an opportunity to grow and improve resiliency—we run blameless postmortems that focus on lessons learned and preventative measures.

Scaling reliability requires a focus on technology, people, and process. In terms of technology, we architect and design in a way that’s conducive to more resilience and fault tolerance when things go wrong. Time to recover after failure detection is crucial, so the incident management and recovery process must be easy to deploy. Finally, it’s important to ensure people are well-prepared and invested in making the overall system better.

— Al Sene, DigitalOcean

Our roadmap has two categories: innovation (product features) and tune-up items, which drive reliability and scale to speed of innovation. Tune-up items comprise the backbone of our innovation and include investing in automation of quality gates in order to release frequently, fast incident detection and response, and refactoring to ensure components are secure, scalable, and available, with low latency and continuous delivery.

This amounts to a self-perpetuating flywheel. Investing in tune-up items—and therefore reliability and speed of innovation—leads to higher development velocity, which improves product innovation, which leads to scaling the team, which then leads to more investment in the speed of innovation.

— Bhavini Soneji, Headspace

Does your organization have dedicated reliability engineers?

We don’t have DevOps or support engineers, per se. We have engineers who build solutions for our platforms and infrastructure and are responsible for building systems that enable our product teams to design, build, and ship changes quickly. These solutions include CI/CD, our event bus internal libraries, federated authentication services, and more.

— Victoria Puscas, Deliveroo

We have reliability engineers and resilience engineers. Reliability engineers own our platform’s integrity, which represents customers’ trust in our services in every respect, including stability, security, and data integrity. Resilience engineers look for brittle points across all our systems and rebuild accordingly.

— Laura Thomson, Fastly

We currently have a small, specialized group of dedicated reliability engineers, with the intention to grow into a larger SRE team. They work with product teams to enhance the customer experience and ensure our services meet their target availability.

— Al Sene, DigitalOcean

We don’t have dedicated reliability engineers. Our approach is to give engineers and development teams end-to-end ownership of their scenario or component. We have an on-call rotation for each platform (iOS, Android, API, and web) with weekly rotations and a warm hand-off. The on-call engineer is responsible for initial triage and routing to the team that owns that scenario or component if there’s no standard playbook, as well as for both production and non-production environments and manning mobile client releases.

— Al Sene, DigitalOcean

What measures or metrics do you use to capture investment in reliability?

At a team level, we look at a prioritized list of repair items and reliability-related tickets every sprint. We also track our teams’ service uptime, SEVs by level, time in SEV, and time to restore. At a higher level, when we plan for the quarter or longer time horizons, we create engineering goals around making our platform more stable and scalable. These become part of our roadmap.

There are also times when we need to take a stance on some of our systems’ older parts. For example, we might have a company-level initiative to improve the tools we give restaurants to manage their commercials or menus. For such initiatives, we typically prioritize quality over profitability as metrics. These changes are intended to significantly improve the experiences of our restaurants, riders, or internal partners by giving them useful, reliable, mature tools and systems to grow and work with.

— Victoria Puscas, Deliveroo

We measure common metrics like time to detect and time to recover from issues. We’re also always looking for ways to reframe base reliability metrics around customer impact. It’s important not to default to only measuring easy things, but to really dig in and ask teams, “What are your goals? What do you need to do to get there?” Metric choice is critical to driving the right system optimizations.

— Laura Thomson, Fastly

We strive to monitor and measure everything we can to encapsulate all aspects of the process, find errors, determine the health of our systems, and identify opportunities for future growth. We enact our commitment to reliability by making it everyone’s responsibility.

Typically, we measure overall system availability, subsystems health, and customer experience against internal objectives. We also take into account industry-standard metrics like time to detect, time to recover, change fail rates, and outage durations. As we continue to make investments, we monitor how these metrics trend over time.

— Al Sene, DigitalOcean

We evaluate reliability investment by asking:

  • What’s the customer impact? (e.g., bugs, latency, time to value)

  • What’s the business impact? (e.g., brand trust, revenue impact due to downtime)

  • What’s the impact on the operational efficiency of the business? (e.g., developer and staff productivity)

We use different sets of metrics. The first is a high-level reliability view, which includes count of incidents and customer impact, revenue implications, crash rate, and app store review rating. The second is around production incidents and follow-through, which includes mean time to resolution, fix rate for production bug fixes and incident action items, percentage of incidents reported through customer support versus detected through alerting, and more. The third is proactive software development life cycle pipeline strengthening, including releases canceled or rolled back, quality gates, count of end-to-end test automation and coverage, pre-production environment uptime, and more.

— Bhavini Soneji, Headspace

When it’s been a while since your last incident, how do you keep your teams sharp and ensure continued investment in reliability?

We celebrate our “longest since last SEV” moments, and if there are no incidents to discuss during our weekly live service health review sessions, we share best practices and celebrate good work.

We also proactively prepare for Q4, our busiest time of year. In September, our growth and daily order volume typically exceed expectations, which might come with unexpected production problems. During these periods, we prepare our systems to cope with roughly 10 percent more load every week. This involves looking at the system’s weakest points and proactively mitigating any risks.

— Victoria Puscas, Deliveroo

Along with regular onboarding and training, we run tabletop exercises, or “pre-mortems,” which help employees hone their skills and get ahead of potential problems. These exercises take an operational team through a plausible scenario—usually complex, worst-case–style scenarios—and the team works through how they’d respond, what they’d investigate, and potential fixes. Then, if the worst really happens, people might have already thought through what to do and will be calmer and better prepared to respond. These exercises are also fun and great for team building.

— Laura Thomson, Fastly

We keep our team sharp by improving existing processes and conducting readiness reviews to anticipate what could go wrong with new code and how to avoid it in the first place. We want to ensure the team is proactive, not just reactive, and we know there’s never a shortage of areas for improvement. We focus on better detection, prevention, and testability of our recovery procedures. It’s crucial to have a surefire testing method that easily understands incidents as they occur so they can be mitigated in the future.

— Al Sene, DigitalOcean

Reliability is a culture, and it has to be embraced by engineers, product managers, and designers. We want to instill a culture of data-driven decision-making, and we want teams to proactively inspect the health of their releases and ensure they can analyze and detect issues.

We also prioritize transparency around incidents and learnings by adopting biweekly live site meetings with leads, fixing calendar slots for postmortems with learnings broadcast to the org, and quarterly chaos simulations in pre-production, which we aim to automate.

— Bhavini Soneji, Headspace

How do you think about and/or fund projects to address low-probability but high-risk events?

Some events can be mitigated with a clear procedure and line of escalation. We have a few documents that describe what we might need to do and who we might need to involve in case of a data breach or security vulnerability. (Fortunately, I can’t remember anything like that happening in the four years I’ve been here.)

There’s also a risk that your third-party provider goes down or a data center experiences an outage. Then, the question becomes about investing the time to mitigate these issues now versus later. We’ve started planning to build an infrastructure that’s resilient to local failures in the next few years by isolating issues to a specific region, rather than letting them affect customers elsewhere.

— Victoria Puscas, Deliveroo

These events are best prevented at the architectural level, whether in the design phase or via the systems thinking and hacking projects our resilience engineering team takes on. We also work through simulations to explore prevention and mitigation strategies, including the tabletop exercises mentioned prior, as well as network simulations using in-house tools.

— Laura Thomson, Fastly

Just because an event is low-probability doesn’t mean you shouldn’t prepare for it. You should invest in contingency plans to stay ahead of any incident that might pop up. This ensures you’re taking care of customers regardless of what’s happening behind the scenes.

We continue to invest heavily in building reliable, secure and highly available services across our portfolio of products. This is table stakes for the cloud industry.

— Al Sene, DigitalOcean

For cloud region and data breaches, the first order is architecting and designing the system correctly, ensuring we have data encryption and data backups, and having stateless services driven through configuration. We ensure our roadmap and strategic planning prioritize strengthening our quality gates to mitigate incidents proactively with fast detection and response, chaos simulation, and cloud region and data security. Strengthening cross-functional teams, processes, and foundations isn’t necessarily exciting, but it’s key to innovation.

— Bhavini Soneji, Headspace

What would you share with rapidly growing tech companies to help them on their own reliability journeys?

Identify the most critical areas of your systems and proactively look at what needs to be done to support growth by at least one order of magnitude. I also recommend avoiding making huge changes in one go—anything bigger than a few weeks of work is probably too big, unless you have a specific reliability or scalability problem and know exactly what you’re doing. Why? Because this particular work, improvement, tech stack, etc. might not be the solution, and it may be difficult to convince the company to let a team take a six-month journey without a guarantee of success or improvement.

Things will still go sideways. That’s normal. Try to iterate and experiment quickly. Find a problem, prioritize it accordingly, and test it. There is no other secret sauce!

— Victoria Puscas, Deliveroo

As you scale, you have to reinvent what you’re doing. You can’t just scale it up linearly—you have to get creative. Think divergently, and don’t be afraid to try something completely different. At Fastly, we’re scaling our performance test platform so we can squeeze every drop of performance from our systems under production-like traffic. Figuring out how to simulate realistic load has been an interesting challenge.

— Laura Thomson, Fastly

My recommendation to other high-growth tech companies is to build a culture of learning within your organizations. Be transparent with your teams about incidents and the lessons learned, and continue to iterate. Honesty and humility about your shortcomings is the first step to ensuring future success.

— Al Sene, DigitalOcean

You have to have the same mindset for reliability as product development. Many factors will shape your particular investment, but the key is making it part of the company’s DNA to find alignment between product and engineering teams during company strategy planning.

As feature teams broaden to include different product lines, staffing horizontal teams becomes critical to laying the technology foundation, driving reusability, and maintaining consistency. Platform teams lay out the common building blocks that application teams can build on or reuse, while infrastructure teams lay out the framework that application teams will integrate to drive continuous releases with quality gates and enable fast detection and response.

The bottom line: Have transparency and clear communication around decisions and trade-offs, while having the flexibility to align with business priorities and meet hard deadlines.

— Bhavini Soneji, Headspace

Buy the print edition

Visit the Increment Store to purchase print issues.

Store

Continue Reading

Explore Topics

All Issues