“My cardinal piece of advice would be to foster an environment where incidents and outages are handled like unpleasant but unavoidable occurrences; the only way to never have outages is to not run anything in the first place! If everyone in the team feels comfortable declaring outages and facing them, with proper focus and application the rest will take care of itself.”
— Alexis Lê-Quôc, CTO at Datadog
“[If we had to start from scratch], we would probably centralize more of the incident response function, because a significant component of that comes from the experience you gain running incidents. With sufficiently large teams and stable systems, no individual experiences enough to become really good at it. Therefore, centralize function to improve the chances of getting good at it.”
— Niall Richard Murphy, Head of Ads Reliability Engineering at Google
“Don’t try to reinvent the wheel—there are a lot of articles, books, and other material online that you should be reading and learning from. Just like everything in engineering, take an iterative approach. Find a reasonable and simple process to try right now, see how it works for a few months and/or incidents, have a retrospective, and adapt based on what you have experienced and learnt.”
— Phil Calçado, Director of Product Engineering at DigitalOcean
“Keep [your on-call practices] simple but rigorous. Executive buy-in is critical. This is not something that should be bottoms-up, [but] should be driven from the highest levels of engineering. Reliability is easy to lose, and once you’ve lost it, it’s easily a massive project to climb out of the hole.”
— Andrew Fong, Director of Engineering at Dropbox
“Smaller companies [have fewer] on-call engineers and more tribal knowledge. The keys to success include not having alert fatigue and making sure that knowledge is captured and shared. Make sure that everything that alerts is actually human-actionable and be aggressive about suppressing alerts and fixing those root causes. Additionally, make sure there is solid coverage—do not rely on the “hero” approach of having one person handle incidents.”
— Sweta Ackerman, Engineering Manager at PagerDuty
“Don’t over-engineer the process when you are small—cover the basics. Bias towards recovery versus investigation during incidents. [Use postmortems to] follow up on remediations, and perform review for major incidents. Alert on a few high-quality business metrics to start—make sure users are being served!”
— Jeremy Carroll, Site Reliability Engineer at Pinterest