Quis custodiet ipsos custodes?
Monitoring agents are our modern internet fire alarms. While we sleep, they keep an eye on our most critical systems. And when something goes wrong, they find us and alert us, rousing us from our slumber. Monitoring is used to test that our systems in production meet our expectations in real time. If we expect our cloud infrastructure instance will only consume 75 percent CPU, we can use a monitoring agent to alert us when we hit 65 percent.
But who monitors the monitoring? Chaos engineering injects failure into these systems on purpose, in a controlled way, to test our monitoring and alerting. Think of it as unit testing your monitoring and alerting. Or, put another way: What happens when the fire alarms don’t work?
Many a tale has been told about how monitoring (or a lack thereof) can cause outages. These tales are passed on by word of mouth, warnings to future engineers who hope not to get caught out in the same way. As the Chaos Engineering Crypt Keeper, I will open the unmonitored crypt to do the same. Listen closely, and avoid the terrible fates that have befallen your forebears.
The thresholds that were too wide
On a dark and stormy night, a page went off, waking the engineer from his sleep at 3 a.m. He rolled out of bed, slid past his puppy, and opened his laptop in the January darkness. Before he’d roused himself enough to take action, the page auto-resolved and all was, seemingly, okay. “The thresholds were too sensitive,” he thought, and changed the alert to only fire after 15 minutes instead of five. He crawled back into bed, pushing the alert threshold change to the back of his mind. At work the next day he didn’t even think to mention the changes.
Achieving a service-level agreement (SLA) of 99.999 percent requires an outage of a maximum of five minutes and 15.6 seconds a year.
One week later, disaster struck: A catastrophic high-severity incident (SEV 0) occurred because the threshold was now too wide. The on-call engineer slept through a 15-minute outage. The company was . . . unable to hit their five 9s of reliability. 💀💀💀💀💀
The engineer who would only ack
Once upon a time there was an engineer who would only ack. He was on call through the night, but his pillows were more enticing than his work. When his pager went off, he would ack, then tumble back into dreamland without investigating any issues. The rest of the team slept soundly, not knowing their comfort was a mere illusion. The morning after the ack-ing, the manager wondered how so many alerts could be ack’d so quickly while continuing to fire—it made no sense. She questioned the engineer, who confessed he only ever ack’d and never opened his computer to investigate, mitigate, debug, or resolve the issues.
This is dangerous and results in a loss of trust. You can imagine what happened next.
When his pager went off, he would ack, then tumble back into dreamland without investigating any issues.
The service that didn’t get agents
A product team needed a new feature to be built for several large customers—yesterday. They scoped what they wanted and begged the product engineering team to move it to the top of their list. The product engineering team complied, working hard to build and launch it exactly to spec and as quickly as possible. But something wasn’t right after launch—they were getting complaints every day. “I can’t actually create anything?” “It deleted all my data.” “It’s just hanging there not doing anything!” The SRE team parachuted in to assist. Within five minutes they realized there was no monitoring, no alerting—no way to see how the service was operating. The SREs helped the product engineering and product teams ensure they had effective monitoring, alerting, capacity planning, incident management, and release as well as test management measures in place for the service and for future product launches. It took several weeks to debug and resolve the issues.
A grimacing conclusion to what might have been a triumphant product launch. 😬
The agents that failed to report
It was 7 p.m. on a cloudy Wednesday night. The dashboards were empty, and no metrics had been reported for five minutes. There was a gap in the timeline.
An on-call engineer became concerned: “No metrics at all for five minutes? Do we not have any monitoring right now? If we don’t have any monitoring, then we don’t have any visibility into how the services are operating!”
Wiping sweat from her brow, the on-call engineer reported the outage as a SEV 0. Fortunately, the team had been trained to handle this exact issue during a recent GameDay focused on monitoring and alerting. The incident reporting tool automatically paged the internal monitoring team, and the monitoring on-call engineer hopped into the automatically generated Slack channel for the incident. They confirmed that agents were failing to report and began to resolve the incident. Unfortunately, the monitoring team wasn’t paged for agent-check status failing, but they later put in place a fix to ensure this same incident wouldn’t go unreported again.
The monitoring software whose metrics were delayed
An engineering team was using a third-party monitoring software to monitor all services across their company. But one August morning, the SRE team received a page that the third-party software was experiencing a delay processing metrics:
We’re actively investigating increased metric intake latencies. As an effect of latency, metrics on graphs may be delayed. These delays may result in ‘no data’ alert conditions for Metric Monitors. To avoid spurious alerts we’ve temporarily disabled these alert types. It’s important to note metrics are delayed but will be backfilled once all services are operational.
This incident took an hour to resolve!
The retries that just kept happening
Engineers looking at iptables saw the number of packets per second sent to the VPC resolver indicate a significant increase in traffic during specific time periods. Taking a closer look at the shape of the traffic coming into the DNS servers from the Hadoop job, they noticed the clients were sending the request five times for every failed reverse lookup. Since the reverse lookups were taking so long or being dropped at the server, the local caching resolver on each host was timing out and continually retrying the requests. On top of this, the DNS servers were also retrying requests. As a result, request volume increased by a factor of seven.
An investigation into the depths of the retries uncovered areas where the monitoring could be improved. If the engineers hadn’t taken the time, this tale from the crypt would be much spookier. 🙀
But what if a tale comes back to haunt us? Well, then it’s time to call in the chaos engineers.
All of these stories have something in common: Nothing was being done to monitor the monitoring. Problems existed undetected and unresolved, hidden in the darkness. These issues lurked in the shadows until—BOOM. Disaster struck. So how do we prevent it?
The metrics of monitoring
First, establish the critical metrics you need to be monitoring. These include business-level objectives (BLOs), service-level objectives (SLOs), service-level indicators (SLIs) / service key performance indicators (KPIs), and service-level agreements (SLAs). If your monitoring is not working accurately, then you won’t be able to track your core business and system metrics.
BUSINESS-LEVEL OBJECTIVES (BLOS)
Asking your business team what your BLOs are will enable you to focus your chaos engineering efforts on core services. For example, if you’re working on an e-commerce business, your BLOs will likely be related to payments, the shopping cart, new user registration, and auth.
SERVICE-LEVEL AGREEMENTS (SLAS)
An SLA is a contract between your company and a customer that specifies that your product’s availability will meet a certain level over a certain period. If it fails to do so, the company must pay some kind of penalty, often in the form of cash or credits. Frequently, an SLA is measured by the average number of 9s during a calendar year.
SERVICE-LEVEL OBJECTIVES (SLOS)
Establishing and tracking your SLOs will give your entire engineering organization a birds-eye view of your most critical services.
SERVICE-LEVEL INDICATORS (SLIS) / SERVICE KEY PERFORMANCE INDICATORS (KPIS)
Monitoring software is responsible for collecting and reporting these specific metrics. It’s useful to roll them up each day and send them out via email to a metrics list anyone within the company can subscribe to. Everyone on the mailing list can verify that both the service and metrics reporting are meeting expectations.
The methods of monitoring
GameDays and continuous monitoring verification are solutions to the problem of not having visibility into your monitoring infrastructure. GameDays allow you to identify the gaps you need to address, and continuous monitoring verification will help you ensure that once you’ve identified and resolved them, you won’t get burned again.
I recommend these two solutions because they’re efficient and effective ways to keep yourself from becoming a cautionary tale from the crypt. For example, you could run a GameDay to solve “The service that didn’t get agents”: You can purposely inject failure into the system by creating a service with no monitoring agent and use this anomaly service to test that if a service doesn’t get monitoring agents when it’s supposed to, it’s either resolved automatically or you’re alerted to the issue.
GameDays were created by Jesse Robbins when he worked at Amazon as the “Master of Disaster” and was responsible for availability. These two to four–hour team undertakings aim to increase reliability by purposefully creating major failures on a regular basis. (They also help facilitate chaos engineering because they involve running a sequence of chaos engineering experiments.) Typically, GameDays are held on a monthly cadence and involve a team of engineers who either develop an application or support it (ideally you’ll have both in attendance).
GameDays are a great way to formalize your chaos engineering experiments in a thoughtful and controlled manner.
Here’s what you need to get started:
- Set goals: GameDays need goals in order to ensure that you’re creating relevant test cases. Sometimes the goal is to replay as many previous production impacts as possible to test whether or not the current systems are more or less resilient. Other times it’s to ensure a new system has all the right monitoring, alerts, and metrics in place before it’s deployed to production. Your goals will determine who you need to invite to the GameDay.
- Send out invitations: After you’ve determined your goals, you’re ready to determine who you need in attendance. Send invitations to everyone on your list, from engineering VPs to staff and principal engineers. Send your attendees a very simple placeholder invite. And don’t give too much away.
Chaos Day is coming!
Keep this day free, we need you.
- Whiteboard the system architecture: On GameDay, all the right people are dialed in or otherwise present. Now it’s time to whiteboard a system’s architecture. This session helps clearly illustrate (somewhat literally) what you’re about to break, and makes obvious the areas worth testing. It also creates contextual consistency by bringing everyone up to date on the latest build of a system.
Design test cases: Develop test cases to answer the questions “What could go wrong?” and “Do we know what will happen if this breaks?” As the team looks at the architecture on the whiteboard, you’ll start to identify areas of concern. “What if we lose monitoring agents on instances?” “What if monitoring agents fail to deploy?” “What’s the mean time for monitoring agents to be redeployed and start reporting again?” These are your test cases.
Scope is important for GameDay test cases. Often there are two angles to a blast radius:
- Impact at the host level: How bad is this failure?
- Number of hosts: How widespread is this failure?
Very different types of failures can be plotted along these axes. For example, you might start with one service that doesn’t have monitoring agents. You’ll then increase the blast radius and test to see what happens when there are five services with no monitoring agents. Gradually increasing the blast radius keeps your chaos engineering experiments safe and controlled. (Do you see a pattern here?) Rather than leaping head-on into a full-blown region failover, a gradual increase keeps you safe and enables you to learn in a much more detailed way.
Yet another parameter is time: How long do you want to run an experiment? Not all applications show a fault within a minute. Sometimes systems, especially hosts with plenty of resources, take time to get backed up or slowed down. Determine a reasonable duration to run the test that will show a change in the system.
Last, but never least, have an abort plan. This is especially important if you’re testing in production. The last thing you want a chaos experiment to do is stick around long past its welcome. That would certainly land your experiment in the crypt.
The last thing you want a chaos experiment to do is stick around long past its welcome.
Run watch and document: Once you’ve determined a set of test cases, it’s time to execute. Watch the main dashboard you’re using to monitor this exercise. As you’re conducting tests, ask yourself:
- Do we have enough information?
- Is the behavior what we expected?
- What would the customer see if this were to happen?
- What’s happening to systems upstream or downstream?
Collect and document answers for each test, then move on to the next one.
Learn: What happened? Was that expected? What do you do next?
After you’ve run the tests, go over your notes as a group. During this recap, review the tests that showed interesting results first. Discuss in depth what happened, why those things happened, and how you plan to address any follow-up items. Treat these test results as if an incident has actually occurred and you’re in the midst of an incident review. Fill out the Jira tickets, implement monitoring changes, or add an item to the runbook—all valuable processes to follow in reviewing any incident.
Touch on the tests that have resulted in no impact as well. If a particular test has had an impact before, applaud the fix that mitigated the failure state. Consider automating this test so that subsequent builds don’t introduce a regression.
Continuous monitoring verification
In addition to monthly GameDays, continuous monitoring verification helps ensure that you’re learning from your GameDays by automating past tests and running them regularly.
First, you’ll need to identify the chaos engineering attacks that should be run in a continuous mode to verify your monitoring. For example, if your first GameDay focused on ensuring that you’re ready to protect against “The service that didn’t get agents” tale from the crypt, then your chaos engineering experiments will be focused on monitoring agents not being installed on services. You could, for instance, use a process kill to kill the monitoring agent continuously during the GameDay time frame, then expand the blast radius by running the same chaos engineering experiment on five services. The next step is to inject failure to trigger verification. By continuously running chaos engineering experiments to verify your monitoring you move from reactive to proactive testing.
The combination of GameDays—an excellent discovery exercise—and continuous monitoring verification—their perfect complement—is effective because it enables you to be both strategic and tactical on a daily basis. Combined, these methods will empower you to use tests that prevent the tales from the crypt I scared you with earlier. Start where you are and I guarantee you’ll uncover many toe-curling findings. It’s time to face the darkness.