September 2019. For months, we’d been seeing a steep increase in the number of projects being built on Glitch, and more and more traffic headed to them. We struggled to keep up. What had started as performance edge cases had become commonplace. Many unknown unknowns had become known unknowns.
This reconstructed diary chronicles our efforts to better understand our production systems and ultimately make them more reliable, a journey that led us to adopt distributed tracing and use an observability vendor. Hopefully reading about our experience can offer some inspiration and insight for your own.
Day 1: We have a problem
On Glitch, you can build everything from simple static sites to full-stack Node.js projects. When full-stack projects aren’t worked on or visited for a few minutes, we “stop” them to save resources and “start” them again on demand. How long a project takes to start depends on its complexity, but recently we’ve noticed something strange: The same project might start in a couple of seconds one day and a minute the next. To make matters worse, an increasing number of users are reporting these fluctuating start times.
We can’t find any obvious reason or pattern that would explain this. It’s time to start debugging.
Day 2: Digging in
We have metrics that might help us debug this issue. I’m quite fond of one in particular—we go way back—that measures how long it takes to start a project. In the past, we’ve used it to identify full-system performance regressions.
The problem is we only have access to aggregated values such as the mean, median, and 95th percentile. Since the slowdown isn’t affecting every project, these aren’t very useful. The slow-to-start projects haven’t made a dent in the aggregates, and we can’t use the metric to narrow down which projects are affected.
We might be able to pinpoint the cause if we can slice the metric by a few more dimensions: project ID, release version of the service, instance ID, and so on. We’ll add a few more labels and debug further tomorrow.
Day 3: You’re charging us how much?
Oh no. We’ve blown through the number of time series our metrics vendor allows per month. Unless we get the situation under control, they’ll charge us an additional $200,000—monthly.
Project ID, it turns out, is a high-cardinality label—that is, a label for which there’s a very high range of possible values. For each unique combination of labels, a new time series is created. The total possible number of time series can be calculated as
# of projects * # of hosts * # of service release tags. With millions of projects, thousands of hosts, and several releases each day, we’re producing millions of time series.
Without support for high-cardinality labels, we’re greatly restricted in the kinds of questions we can ask of our systems. Metrics won’t be the golden ticket to debugging this narrow performance regression. Better remove those labels again…
Day 4: Let’s check the logs
Luckily, we have structured logs. Our services emit blobs of JSON objects that are processed by our log pipeline. There’s no limit to the number or cardinality of labels we can attach to them.
We’ve instrumented our services such that a unique request ID is generated for each request and propagated through our systems; this allows us to easily view all logs for a single request. We’ve also instrumented our most important code paths with log lines that trace the timing of important operations, resulting in log lines that read
<operation name> started and
<operation name> ended <duration>. They allow us to look at individual requests and identify where latencies are introduced. We’ve used them to debug some pretty gnarly performance regressions. It’s not the easiest way to do it, but it’s better than nothing.
There’s one problem, though: scalability. Currently, 65 percent of our log lines are for timing operations. As traffic increases, this is becoming prohibitively expensive. There must be a better way.
Day 23: Observability to the rescue
The term “observability” has been popping up on my Twitter feed a lot recently—and it turns out there’s a whole community of people who’ve been through this before. Many have invested their talents, time, and resources into figuring out how best to build systems that are easy to inspect when operational hiccups, performance regressions, and other unexpected things happen.
Observability is all about being able to ask questions of your systems and get answers based on the telemetry they produce. We’re definitely in need of some answers.
Day 25: Telemetry talks
After countless hours of research, we believe distributed tracing is a promising solution. It addresses both of the issues we experienced with metrics and logs, allowing for high-cardinality labels and supporting sampling, which keeps costs down.
A distributed trace tracks the progression of a single request as it’s handled by your services. A trace is structured as a tree of spans, which are named, timed operations to which you can attach an arbitrary number of labels. If you squint, you could argue that our current logging infrastructure is a bespoke implementation of distributed traces. However, a proper distributed tracing solution allows for sampling, and the traces contain a richer structure, which allows for more sophisticated visualizations and analysis.
So, how do we go about getting our services to emit trace data?
Day 30: Tuning our instrumentation
Instrumenting your services to emit distributed traces is more conceptually advanced than logs or metrics, since you have to propagate contexts between services and think about sampling. In our case, though, this isn’t too heavy a lift. Context propagation isn’t a new concept to us—we already propagate request IDs for our logs—and the ability to perform sampling is one of the reasons we wanted to look into distributed traces. It also helps that we only have a handful of microservices.
Since instrumenting services is time-consuming, we’ve decided to use the vendor-agnostic OpenCensus project (later superseded by OpenTelemetry). That way, we can always decide to change vendors without having to re-instrument everything. We use the Node SDK to instrument our services and have them send traces to the OpenCensus collector. The SDK takes care of context propagating and sampling, while the OpenCensus collector ingests the trace data and sends it off to a vendor or tool, in our case Honeycomb. We decided to use a vendor rather than self-host a tool to save our small team of engineers time.
Day 32: Catching the culprit
Now that we have access to highly detailed traces and sophisticated tooling to query them, we’ve been able to identify the cause of the slow-to-start projects: caching. We visualized a heat map of the duration of all traces for starting projects, and inspected the trace-view of one of the slowest. This showed us that latencies were introduced in the
project.initialization span. We then visualized a heat map of the duration of all
project.initialization spans and started grouping by various labels until we found a pattern. Spans labeled
project.reuse = false were consistently slower.
The tool couldn’t tell us why—for that we’d have to rely on our expertise and knowledge of our systems, or pull out the source code and start digging. In this case, we knew that to speed up project start time, we try to place projects on the host they last ran on to reduce the amount of initialization we have to do. This shortens start times and reduces strain on the hosts. However, now that we’re running many more projects, the previous host is far less likely to have available capacity. Projects increasingly have to perform cold starts, which makes it more likely that other projects will start up simultaneously, creating a vicious circle. We’ll have to revisit that scheduling algorithm.
Day 176: Welcome side effects
December 2019. We’ve had all of our services instrumented for about six months now—it’s time to take stock of our observability journey.
We’re still iterating on our sampling configuration. We use head-based sampling and only retain about 3 percent of traces. This has allowed us to debug most issues, but the low sampling rate can be problematic if we’re trying to investigate something less common, as it’s unlikely to get sampled. There are more sophisticated sampling techniques we could adopt to alleviate this, but for now we still rely on logs in these cases.
There’s always a learning curve when introducing a new tool, and adoption has to be nurtured. We had an initial training session with Honeycomb, performed regular show-and-tell sessions internally to share tips and tricks, and highlighted interesting queries during incident retrospectives. Most of our engineers are now comfortable with Honeycomb and use it regularly.
So, was it worth the investment in time and resources? I believe so. The observability practices we’ve adopted to help debug slow project starts have had positive side effects. With richer telemetry and more sophisticated tooling, we’re quicker to identify problems and disprove hypotheses. We’ve greatly reduced the duration of incidents and rarely have to make code changes to get answers to our questions. This is where observability really shines: It enhances your ability to ask questions of your systems, including ones you hadn’t originally thought to ask.
But perhaps the most valuable thing we’ve gotten out of this journey is a shift in perspective. Now that we have deeper visibility into our production systems, we’re much more confident carrying out experiments. We’ve started releasing smaller changes, guarded by feature flags, and observing how our systems behave. This has reduced the blast radius of code changes: We can shut down experiments before they develop into full-fledged incidents. We think these practices, and the mindset shift they’ve prompted, will benefit our systems and our users—and we hope our learnings will benefit you, too.