Containers, Kubernetes, and microservices have enabled operations teams to move ever faster, producing more applications, packaging them up, and scaling them to meet users’ needs. But as infrastructure scales, it generates more data, and developers have to work harder to cut through the noise.
The three pillars of observability—metrics, logs, and tracing—are a great place to start. Let’s explore the benefits, challenges, and use cases of each to come away with a stronger understanding of how to make the most of our container data.
Using metrics meaningfully
Metrics allow you to evaluate, at a high level, whether a system is behaving as expected. As such, they’re typically used for alerting. There are hundreds of metrics to describe each container, from memory to over-network activity to CPU, so to keep your alerts targeted and useful, focus on ones that describe the end user experience: server error rates, latency, saturation, and so on. Metrics that describe causes, rather than symptoms, of a service failure, like a specific error, CPU, or increased memory, are less helpful: Knowing CPU usage is at 90 percent, for example, doesn’t tell you anything about how users experience the service; knowing 99.9 percent of requests were successfully processed within 100 milliseconds, on the other hand, tells you your service is reliable and fast enough to provide the desired user experience.
SLOs remind us that software, just like the humans who create it, is imperfect.
Engineering organizations typically formalize their reliability goals as service-level objectives (SLOs) and assign metrics to them. This is an effective means of focusing the information generated across an entire infrastructure, providing concise, high-level insights as your organization deploys a growing number of containers. SLOs remind us that software, just like the humans who create it, is imperfect; defining SLOs—for example, requiring that 99.9 percent of all requests be processed successfully within 30 days—helps establish how imperfect a system can be while still providing a positive user experience. Metrics such as queue length, number of container crashes, and dropped network packages can also be useful for debugging, but they shine brightest when leveraged to inform user-centric SLOs.
Logging with intention
Logs are a strong source of specific, detailed information, like the precise error that triggered an alert, which users performed high-latency requests, and how often and why requests were retried. They also make it easy to add metadata to each container, such as the container name and the node it’s on, as well as Kubernetes-specific metadata like pod and namespace. Since Kubernetes’ metadata terminology is consistent across every cluster, this can save developers valuable time finding the right logs when troubleshooting.
Because they reflect such fine-grained details, logs tend to be most useful for determining the causes of errors, rather than being applied to higher-level concepts like SLOs. They can tell you a database connection error occurred, for example, but you’d be hard pressed to use them to calculate the error rate across the infrastructure.
To make effective use of logs and metrics’ respective strengths, apply consistent metadata to both.
To make effective use of logs and metrics’ respective strengths, apply consistent metadata to both. This allows you to jump from high-level metrics to logs that explain the metrics in greater detail, while retaining context about the process being troubleshooted. With a container orchestrator like Kubernetes, for example, you can hop directly from a container’s high server error rate to its logs through its namespace, pod, and name tags, which could reveal a database connection limit that’s causing new connections to be refused, resulting in the error reported through the alerting system.
Developing workflows to identify related data in this manner is crucial to understanding an issue quickly and with the needed context. The ecosystem has given us a common terminology with which to monitor containers, so let’s make use of it!
Distributed tracing allows you to follow requests as they travel through the microservice infrastructure. It’s particularly useful for understanding how a system parallelizes execution, the amount of time a request spends in different parts of a system and across network boundaries, and the decision tree the request has gone through. Tracing allows us to access the complex interactions between services in a single visual interface focused on a particular request, something we can’t do with metrics and that requires a lot of effort and discipline to do with logs.
As with logs, consistently applied metadata is the key to using tracing effectively. Jumping from metrics or logs to traces presents a challenge, though: While metrics and logs describe individual containers and processes, traces represent an entire request as it traverses systems. As a result, metadata from a log or metric doesn’t map one-to-one to traces. To move between logs and traces, you’ll want to include the trace ID in each log line. To move between metrics and traces, you’ll want to use exemplars, or metadata attached to a metric that can contain arbitrary references. Tagging latency buckets with trace IDs using exemplars allows you to open an example trace of a certain latency directly from a metric.
Sampling strategies are useful for keeping the amount of data handled by the database low, which in turn keeps databases performant and costs down.
Because tracing is most useful for troubleshooting situations like high latency or understanding the code flow that resulted in an error, a helpful way to find the signal in the noise is to selectively store data—for example, only when a request lasts longer than two seconds or a database error has occurred. Such sampling strategies are useful for keeping the amount of data handled by the database low, which in turn keeps databases performant and costs down.
Most importantly, sampling strategies allow you to only retain traces worth examining. For example, you can configure your tooling to sample the 99th percentile of total request latency; then, when investigating high latency, you can track down a concrete trace through an exemplar obtained from a metric. From there, using the metadata from the trace, you can jump to the process’s logs to discover a cache miss that caused latency to increase. The trade-off with sampling, of course, is that you’ll only retain the data you’ve configured your tooling to collect; if your sampling strategy is configured to only sample high latency, for example, you might miss traces about things like errors that return quickly.
Tuning and experimenting with sampling strategies to find the right fit for your containers, combined with a consistent approach to container metadata, allows you to keep tracing’s costs manageable at scale—and combines the three pillars of observability into a unified whole that leverages the unique benefits and strategic insights of each.
Going past the pillars
Metrics, logs, and tracing are a great starting point for organizations new to the space, but their coverage isn’t exhaustive. An emerging observability practice called continuous profiling, for example, allows for more detailed performance analysis over time, using low-overhead sampling profiling techniques to continuously profile code instead of doing it ad hoc. Developers can use continuous profiling to examine CPU and memory usage over time, down to the line number, and harness the data to investigate memory leaks, improve code performance, and systematically optimize resource usage.
This is just the tip of the iceberg. As observability practices continue to mature, I expect we’ll see more emerging tools and technologies that help us troubleshoot issues, improve efficiency, and build even more reliable systems. Most likely, consistently applied metadata will be central to bringing it all together, painting a rich and detailed portrait of the health and performance of our containers—and of our systems as a whole.