Testing at scale – Increment: Testing

Rob Zuber

CTO
CircleCI

250 employees

Greg Bell

VP of software development

and

Claudiu Coman

Senior manager of software development
Hootsuite

1,000+ employees

Scott Triglia

Ads group technical lead
Yelp

5,500 employees

How have your test suites changed over the course of your organization’s history? How long do they take to run?

CircleCI has been in the business of continuous integration (CI) and continuous deployment (CD) since 2011. The biggest change in our test suite has mirrored an industry trend: the transition from monolith to microservices. Our test suites have moved with that change and now run on a per-service basis.

Our longest test suite for our monolith takes seven minutes and 19 seconds to run, using 10 parallel machines for greater speed. That test suite is part of a workflow that takes 10 minutes and six seconds in total, including a deploy step.

Our shortest test suite takes 37 seconds, with no parallelism, on our API gateway service.

— Rob Zuber, CircleCI

Five years ago, while working on our Hootsuite Insights product, we had a small collection of automated tests, but they weren’t as reliable as we wanted. Soon after, as we began working on our Hootsuite Analytics product, we thought it would be a good opportunity to rebuild our test suites from the ground up. We had learned a lot from our early experience using automated tests, and we put those lessons into practice with this new product.

As the team and product matured, we started adding more tests and split them into categories. We then added custom tooling that allows us to skip tests that are failing due to flakiness or external dependencies. We also keep track of quarantined tests and try to fix them ASAP.

We transitioned to continuous integration four years ago, which naturally relies on automation. Our CI pipeline runs several test jobs in parallel, with the longest job taking approximately eight minutes.

— Greg Bell and Claudiu Coman, Hootsuite

In the old monolith-only days, Yelp’s test suites tended toward a 50/50 mix of unit tests and integration tests. They ran in one process, and that often enabled integration testing by default. Heavy use of database-level sandboxing (with factories for setup and teardown) prevented test pollution, and there weren’t enough external services to introduce flakiness issues. We also added UI-level acceptance testing with Selenium after discovering a number of critical frontend bugs after the fact. Tests took roughly an hour to run, with some runs up to four hours, but went down to an average of 40 minutes with significantly reduced variance once we introduced cluster auto-scaling and better queue management.

As the migration to services took hold, integration testing by default and running tests off a single process quickly became untenable. We briefly made do with different test runs hitting shared “global” services, but this led to test pollution. To avoid that, individual use cases patched in fake service implementations (e.g., a task worker that actually runs inline when you submit a new task) ad hoc. Eventually, complex call graphs demanded a move to more structured fakes. We’ve had the most success with VCR approaches (named after the device that plays VHS tapes, and rerecorded via docker-compose), which have kept test times under 20 minutes and non-flaky.

— Scott Triglia, Yelp

How heavily do developers rely on your test suites to validate their code? Do green tests mean “good to go,” or are there additional manual assurance processes?

Outside of production-level validation, we rely on automated tests. For us, a merge to master initiates a deployment to production.

We’ll do branch-level canaries in scenarios where we believe it’s difficult to validate without production scale or the oddities of production data. For example, those might be scenarios where there has been a structural change, where we’ve significantly shifted our testing, or where we believe there may be a performance impact introduced by the change. With the sheer job volume running through CircleCI and the huge variety in jobs from customers, we want to be sure we’re capturing every edge case. We never want to be the reason customers’ builds are breaking.

While we don’t require a canary on every deploy today, that’s certainly one of our goals for the future. We also do rolling deploys that will halt on significant errors.

— Rob Zuber, CircleCI

Test suites and test coverage are crucial to our development practices. We always like to take a step back to see what’s repetitive and could be further automated, or even what’s redundant and needs to be removed. We’re constantly refining our practices and trying to get the most from our tests, especially given the importance of automated testing in a CI pipeline.

Our developers run tests on their local machines before a PR is created. They’re run again before being pushed to staging. A blocking set of live tests runs when deploying the web servers: This means that code cannot be deployed unless the test suite passes, which creates a very high incentive to keep our tests up to date.

Even so, we currently perform manual QA for each Analytics feature as a whole. This is subject to change as our test suite grows and coverage becomes more reliable.

In the future, we’d like to focus on strengthening our CI pipeline, including implementing more postproduction checks that can roll back faulty code. We’ve already identified the remaining use cases our manual QA covers on top of the automated tests, and we’re looking to automate those soon.

— Greg Bell and Claudiu Coman, Hootsuite

Yelp’s trust in tests is divided into roughly two camps: a monolith, where we’re more prescriptive about manual testing, and services, where each team is free to determine its own policy.

For our monolith, each push pauses on staging and production. Then, the developer responsible for the push manually verifies their branches or confirms that they don’t need to. (We allow developers to pre-label their branches as “no-verify” to skip this process if tests are sufficient.) Though we had initial misgivings about the safety of this approach, it’s been incredibly safe in practice and has saved a tremendous volume of developer time.

For services, policy is left up to the service owner. Some services trust tests and enforce little other production verification. Often this trust is backed by strong error, latency, and behavior-related SLOs monitored programmatically in production. For sufficiently complex services, some teams still opt into manual verification in production.

— Scott Triglia, Yelp

What kinds of automated tests do your teams tend to write and rely on?

We do unit testing and integration testing. One thing that’s a bit unique is that we offer the same core software functionality in two delivery models for customers: We have a cloud-hosted SaaS version and also a self-hosted version for teams who need their systems behind their firewall.

For self-hosted CircleCI, we do end-to-end testing for our deploy models because there are many possible configurations and variables we need to account for. We can’t use canaries when we’re delivering software behind our customers’ firewalls on their own infrastructure, so that end-to-end validation is really important. It’s an interesting paradox to be a CI/CD company building and shipping packaged software.

— Rob Zuber, CircleCI

We have several types of automations:

Unit tests are the primary focus of our automation work.
Component tests verify interactions between different units of code.

Contract tests check external providers (third-party APIs as well as services developed by other teams at Hootsuite).

API tests simulate user interactions by making direct calls to the webnode. They’re faster than UI tests and cover more complex use cases that span large areas of our infrastructure, mostly focused on testing connectivity with databases and interaction between services in the mesh.
Acceptance tests are a small suite of tests that are focused on validating critical end-to-end flows.
Stress tests simulate several users hitting the servers concurrently. We then measure the servers’ performance and look for changes that might have broken our requirements.
Static analysis on our code detects possible security vulnerabilities.

We use Asgard to do blue-green deploys on our webnode. We recently developed a tool that does post-deploy monitoring of services—for now it just alerts, but in the future it will roll back changes automatically.

All of our services have health checks, and service instances that fail the health checks get pulled out.

— Greg Bell and Claudiu Coman, Hootsuite

Automated testing strategy varies a lot depending on the complexity of the service in question. Our monolith still depends significantly on database-level setup and teardown routines to allow a majority of tests to do full integration testing all the way down the stack. By contrast, in services, unit tests often represent a larger proportion of the overall test suite.

In production, we make broad use of very simple health checking, and we’re rolling out more universal error and latency SLOs. Some services are doing small, local experimentation with automatic rollbacks on violations and synthetic end-to-end health checking of large systems, but we’ve done no broad rollout of either of these approaches.

— Scott Triglia, Yelp

How do you build and support your testing infrastructure on both a human and a hardware level?

We’re definitely a bit of an outlier here!

Since we’re a CI/CD company, we have 250 people devoted to building and maintaining our testing infrastructure and systems and 100 percent of our hardware is devoted to automated testing. We have approximately 12,000 customer tasks executing simultaneously and roughly 1,100 clients. We are great friends with the folks at AWS and GCP.

For us, maintaining testing infrastructure is our core competency. We deliver that to customers so they can focus on their unique value proposition and delivering that value to their users. Engineering time is incredibly valuable, and our goal is to help teams focus on building their next big thing, not building or maintaining CI/CD systems.

— Rob Zuber, CircleCI

Hootsuite has a centralized 21-person production operations and delivery team—soon growing to 33 people! Within that, there’s a build-test-deploy team of five. This smaller team’s mandate is to unify and modernize our disparate build, test, and deploy tools and practices in order to present a single, easy-to-consume build-test-deploy platform to all our development teams. They also own our code and artifact repositories, our homegrown test engineering tools, and our Jenkins and Kubernetes-based build servers with templated deployment pipelines.

The Hootsuite Analytics team currently maintains its own pipeline separate from the centralized build-test-deploy team. The Analytics team doesn’t have dedicated people to work on its testing infrastructure, so the team dedicates 25 percent of its capacity to test maintenance, to reducing technical debt, and to innovation. This explicit dedication of time has contributed to a substantial increase in reliability and productivity.

This independent pipeline runs on a handful of machines, executing our tests and daily automations. Integration testing occurs in the staging environment, which, following our adoption of Kubernetes, is almost architecturally identical to our production environment.

— Greg Bell and Claudiu Coman, Hootsuite

Yelp engineering has largely had an “engineers write their own tests” policy from the early days. There is also a dedicated team, Release Engineering, focused on testing infrastructure and systems.

Traditionally, the team’s mandate has been to maintain shared testing infrastructure like Jenkins. Lately, they’ve focused on offering testing metrics (test suite latency and reliability) across teams and organizations at Yelp to promote local visibility and accountability for testing outcomes.

For microservice testing, they’re aided by work from our core services infrastructure team. Often they’ll help promote modern library usage, like the use of a pytest plugin that automatically integrates with our test results viewing framework.

— Scott Triglia, Yelp

Rob Zuber

Greg Bell

Claudiu Coman

Scott Triglia

How have your test suites changed over the course of your organization’s history? How long do they take to run?

How heavily do developers rely on your test suites to validate their code? Do green tests mean “good to go,” or are there additional manual assurance processes?

What kinds of automated tests do your teams tend to write and rely on?

How do you build and support your testing infrastructure on both a human and a hardware level?

Topics

Buy the print edition

Continue Reading

Mobile

Increment Staff

Mobile development at scale

Frontend

Increment Staff

Frontend at scale

APIs

Increment Staff

APIs at scale

Open Source

Increment Staff

Open source at scale

Testing

Increment Staff

The QA Q&A

Teams

Increment Staff

Engineering teams at scale

Software Architecture

Increment Staff

Software architecture at scale

Remote

Increment Staff

Remote at scale

Reliability

Increment Staff

Reliability at scale

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call