Containers at scale

Engineering leaders at Datadog, Braze, and BetterUp discuss container tools, testing, and monitoring, and how they’ve approached container migrations.
Part of
Issue 17 May 2021


Laurent Bernaille

Senior staff engineer

2,000+ employees

Chris Rogus

Director of engineering

700+ employees

Bryan Hickerson

Engineering manager

270+ employees

What container technologies and tools does your organization use?

The main container technologies we use are Kubernetes, containerd, and Cilium. We run dozens of Kubernetes clusters of various sizes—our largest ones contain over 4,000 nodes each—across multiple cloud providers, and we rely on internally developed tooling to manage and orchestrate deployments over multiple clusters.

We use Kubernetes in AWS and Azure, running dockerized applications in Ruby on Rails, Java, Go, and Python. Kubernetes reports metrics to Datadog, logging to Papertrail, and application errors go to Sentry. We use Sops for secrets configuration and Terraform for defining the infrastructure across clouds Kubernetes is deployed into.

— Chris Rogus, Braze

We use Heroku, which employs lightweight containers called dynos, for our web servers, background jobs, and a subset of our ML microservices. Additional ML microservices use Kubeflow. We align our development and test environments with production using Docker. For logging, reporting, and system health alerts, we use SolarWinds Papertrail and Sumo Logic. For client and application error reporting, we use Sentry. Finally, for performance monitoring, we use Scout and Calibre.

— Bryan Hickerson, BetterUp

When did your organization start using containers, and how have they changed your development workflow?

We started migrating Datadog to Kubernetes in early 2018, with a first version of Datadog running fully in Kubernetes in production after about six months. This included both stateless web applications and stateful data services like Cassandra and Kafka. We were migrating from applications running in VMs managed with Chef, so the transition required many changes to our development process. For example, we had to containerize every application and provide a solution to deploy to Kubernetes clusters, which initially relied on Spinnaker and Helm charts. The migration was challenging: The deadline was ambitious, and we were starting from an environment where we had no containers and no tooling to deploy them. But it was also rewarding because it allowed us to have unified packaging and deployment solutions, making it possible to deploy in new cloud providers and regions.

We started using containers about two years ago as part of our effort to become multi-cloud. This was after almost a year of initial exploration. Containers added some complexity at first, notably in terms of configuration, but as we built out our tooling, some aspects also became easier. For example, configuration used to come from Chef, which required more restrictive permissions for changes. By moving the Chef data bag config into Sops, we enabled simpler self-service changes for developers.

— Chris Rogus, Braze

Prior to 2015, we used a VM-based development environment, then switched to containers due to challenges with native dependencies that were compiled locally and tended to break across upgrades. Switching to containers remedied this—we were able to migrate seamlessly without negatively impacting our development workflow. It also made our development environment more modern and production-like, and less resource-intensive. 

— Bryan Hickerson, BetterUp

Has your organization migrated any legacy applications to containers? What were the challenges, and what did you learn? 

Most of our applications are written in Go, Python, and Java, so running them in containers isn’t that difficult. Of course, the devil is in the details, and we faced several challenges, including managing the memory footprint of the JVM in containers. Most applications assumed they were the only one running on a VM, which brought its own challenges—especially regarding IO operations (disk and network access), since Kubernetes is very efficient at sharing CPU time and memory. It’s more complicated to do this with IO, where Kubernetes offers less control over how to limit and isolate resource consumption. We noticed that we needed twice the number of hosts after migrating an application to Kubernetes. Once we profiled the application and analyzed the overhead, we optimized the pod configuration and greatly reduced the number of additional hosts we needed.

 We’ve migrated almost all of our legacy applications to containers. Dockerizing the applications was relatively straightforward and made packaging dependencies and deployment easier in most cases. Previously, DevOps managed the EC2 instances applications were copied into and run on via Chef. By moving apps into containers, application engineers gained more direct control over which environments applications run in, what tools and libraries are available, and how resources are allocated. 

The challenges were primarily around shifting responsibilities for the deployment pipeline from DevOps to the application engineering teams, and around knowledge of debugging applications in Kubernetes as opposed to on an EC2 instance. All of this has significant long-term benefits, however, eliminating the need to go back and forth on changes and more tightly coupling the code with the environment in which it runs.

— Chris Rogus, Braze

We used containers to experiment with ML microservices. We extracted small portions of our primary app and spun up new services quickly and with a better-suited tech stack. This allowed for rapid iteration and experimentation. For example, we were able to seamlessly replace a model trained with Bayesian methods in R with one using neural networks in Python. 

One challenge we faced in spinning services off of a monolith was that the services no longer had direct access to real-time app data. We had to determine what data the microservices would retain access to and learned that the closer a service was to real time, the more it would need access to contextual data. 

— Bryan Hickerson, BetterUp

How do you deploy and monitor your containerized applications? What are your key health metrics?

We rely on Datadog for monitoring. Each application is responsible for configuring its monitoring, but some key metrics are used everywhere: container CPU and memory usage, container status and number of restarts, as well as the health of underlying nodes. We initially used Spinnaker to deploy containerized applications, which provided a strong foundation early on, but we outgrew it as the number of clusters increased and the workflows became more complex. We’re currently working on an internal solution for multi-cluster deployments that leverages Helm and Cloud Native Application Bundles, backed by Temporal.

We look primarily at memory and CPU, standard Kubernetes monitoring, as well as application-specific metrics like internal queue sizes and error rates. Apps are deployed with Helm, using an in-house tool that provides deployment configurations to the Helm CLI via Jenkins upon changes to the config (YAML in a GitHub repo). 

— Chris Rogus, Braze

We use Heroku to continuously deploy our application when a build passes in our main branch. We use Heroku plus a logging service—Pingdom and New Relic, in conjunction with PagerDuty for alerts—that allows us to investigate issues in production systems and alert our team if issues are detected. We also use synthetic and real user monitoring to detect catastrophic errors and performance issues. As a team, we use KPIs to track trends in our infrastructure. One key health metric is server uptime, which was 99.999 percent in 2020.

— Bryan Hickerson, BetterUp

How does your organization approach container testing? How do you leverage automation?

We don’t perform systematic testing of containers. Instead, we test the application in CI and validate new container versions in staging and using canaries. We also do ad hoc testing of containers when we suspect containerization impacts them—specifically, performance regressions that can’t be explained by changes in the codebase.

Many of our applications are developed and tested locally by running them via Docker Compose. We also run end-to-end tests multiple times each day across development and staging environments running containerized application deployments. CI—we use Buildkite—runs tests inside Docker as well, which runs automatically upon changes to the application code.

— Chris Rogus, Braze

Our test containers are configured to match our production environment. We don’t directly test the containers themselves, but our continuous testing processes ensure application behavior is consistent across branches.

— Bryan Hickerson, BetterUp

How does your organization keep pace with shifts in the container ecosystem? How do you decide when to adopt a new technology or tool?

Since we use Kubernetes at scale and face challenges the ecosystem is only starting to address, we tend to test—and adopt, when tests are successful—new technologies pretty early on. For example, we standardized on containerd as soon as it got a container runtime interface, and we used kube-proxy in IPVS mode when it was available in beta for scalability reasons. More recently, we’ve standardized Cilium for pod networking, service load balancing, and network policies.

Our engineers are attentive to changes throughout the industry. They’re often beating the drum to make changes in order to try out new approaches, including doing proof of concept demonstrations at our internal hack days to explore and gauge interest from others. They follow AWS announcements, Kubernetes announcements, and any number of tech news sources to learn about new options. They seek out solutions when they encounter problems, and they imagine what “better” might look like. For example, we’ve done a lot of research into cost saving through automated instance provisioning with spot instances.

— Chris Rogus, Braze

We rely on engineers to surface opportunities for improving how we use containers, and we weigh the cost against the potential value or need. For example, recently our frontend and full-stack engineers were encountering file system performance issues with Docker for Mac. One of our engineers investigated techniques for improving IO with Docker and experimented with Mutagen, NFS, and other techniques for file sharing between the native system and Docker. Eventually we adopted Mutagen across the team, which significantly improved the developer experience. Build time for frontend containers is no longer a drag on individual developer productivity.

— Bryan Hickerson, BetterUp

What has your organization found most surprising about using containers?

We’ve faced quite a few surprising challenges, from control plane scalability issues to low-level runtime or networking problems. Overall, though, the big success story for Datadog in adopting containers is that it’s allowed us to scale and deploy across multiple cloud providers using a common abstraction.

Containers built for localdev need additional debugging tools that are undesirable in production. Debugging in production is much more difficult than debugging locally, especially with granular access control lists on servers hosting containers. Accurately reproducing a service-oriented architecture locally in containers can overwhelm laptops’ CPU and memory, which leads to shortcuts that still lack some fidelity, like not running “real” Kubernetes clusters or the same configurations. CI builds containers differently than how they’re built locally and can easily include content that’s not present locally, which can be difficult to debug or recognize.

— Chris Rogus, Braze

Containers enabled us to train new ML models on one cloud provider and easily migrate to another when we were ready to integrate them with our primary app. We’ve also been surprised at how few issues we’ve experienced related to the containers themselves. Any issues are typically at a higher level of abstraction than the container level; we’ve found bugs in how we deploy the application, for instance, but they weren’t specific to our use of containers. 

— Bryan Hickerson, BetterUp

Buy the print edition

Visit the Increment Store to purchase print issues.


Continue Reading

Explore Topics

All Issues