Ask an expert: How should startups approach on-call and incident response?

“My cardinal piece of advice would be to foster an environment where incidents and outages are handled like unpleasant but unavoidable occurrences; the only way to never have outages is to not run anything in the first place! If everyone in the team feels comfortable declaring outages and facing them, with proper focus and application the rest will take care of itself.”

— Alexis Lê-Quôc, CTO at Datadog

“[If we had to start from scratch], we would probably centralize more of the incident response function, because a significant component of that comes from the experience you gain running incidents. With sufficiently large teams and stable systems, no individual experiences enough to become really good at it. Therefore, centralize function to improve the chances of getting good at it.”

— Niall Richard Murphy, Head of Ads Reliability Engineering at Google

“Don’t try to reinvent the wheel—there are a lot of articles, books, and other material online that you should be reading and learning from. Just like everything in engineering, take an iterative approach. Find a reasonable and simple process to try right now, see how it works for a few months and/or incidents, have a retrospective, and adapt based on what you have experienced and learnt.”

— Phil Calçado, Director of Product Engineering at DigitalOcean

“Keep [your on-call practices] simple but rigorous. Executive buy-in is critical. This is not something that should be bottoms-up, [but] should be driven from the highest levels of engineering. Reliability is easy to lose, and once you’ve lost it, it’s easily a massive project to climb out of the hole.”

— Andrew Fong, Director of Engineering at Dropbox

“Smaller companies [have fewer] on-call engineers and more tribal knowledge. The keys to success include not having alert fatigue and making sure that knowledge is captured and shared. Make sure that everything that alerts is actually human-actionable and be aggressive about suppressing alerts and fixing those root causes. Additionally, make sure there is solid coverage—do not rely on the “hero” approach of having one person handle incidents.”

— Sweta Ackerman, Engineering Manager at PagerDuty

“Don’t over-engineer the process when you are small—cover the basics. Bias towards recovery versus investigation during incidents. [Use postmortems to] follow up on remediations, and perform review for major incidents. Alert on a few high-quality business metrics to start—make sure users are being served!”

— Jeremy Carroll, Site Reliability Engineer at Pinterest

Topics

Buy the print edition

Continue Reading

On-Call

Increment Staff

On-call at any size

On-Call

Increment Staff

What happens when the pager goes off?

On-Call

Increment Staff

Who owns on-call?

On-Call

Increment Staff

The benefits of transparency: Interview with Sytse “Sid” Sijbrandij, CEO of GitLab

Development

Increment Staff

What it’s like to be a developer at …

Cloud

Increment Staff

An interview with Ben Uretsky and Julia Austin, CEO and CTO of DigitalOcean

Programming Languages

Increment Staff

Six questions on programming languages

Open Source

Increment Staff

Open source at scale

Testing

Increment Staff

Testing at scale

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call