Heidi Waterhouse
Everything is broken, and it’s okay
Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.
This issue shares approaches to reliability and resiliency in our software, technologies, and teams, and offers perspectives on the realities of failure in the systems we build.
Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.
By encoding resilience into an organization’s culture, engineering teams can be better equipped to tackle the unknown and unexpected.
The way we fight fires affects how quickly we can resolve outages. Appointing an incident commander can help—and you (yes, you) can be one.
Pseudo-tested methods can be a reliability risk. Here, the authors explain how they developed a methodology and tool to uncover them in Java applications.
To build a high-performing software delivery system, your stack’s capabilities are just one part of the picture.
A chronicle of Glitch’s efforts to gain visibility into its production systems—and make them more reliable.
As software systems become ever more complex, chaos engineering provides a (not-actually-so-chaotic) tool kit for building more reliable and resilient systems.
Documentation, automation, and a little sharing-is-caring can help OSS projects maintain their uptime.
A case for designing consumer software with safety-critical principles and formal methods in mind.
A discussion of the distinctions (and dependencies) between reliability and resilience, and how to build complex systems that perform under strain and surprise.
How Yelp engineers orchestrated their traffic failover process and effected a delicate balance between reliability, performance, and cost efficiency.
Learnings for tech orgs looking to adopt a resilience engineering perspective.
Strategies for nurturing that feel-good sense of accomplishment when doing largely invisible work.
If a major solar storm were to sweep across Earth, would today’s electrical and communications infrastructure be resilient enough to endure its impact?
Facing dramatic shifts in residential usage, internet service providers are working to keep latency low and connectivity high.
Leaders at Deliveroo, DigitalOcean, Fastly, and Headspace share how their organizations think about reliability and resiliency and their advice to engineering orgs embarking on reliability journeys.
The company’s disaster preparedness plan, developed in the aftermath of a devastating cyclone, enabled it to adapt and endure during a global pandemic.
Seeing a year’s worth of capacity growth in a matter of weeks, the CDN services provider hustled to build and reinforce the infrastructure it needed to serve its users (and European soccer fans).
You can unsubscribe at any time. Please see our Privacy Policy for more information.