Ask an expert: What’s the value of transparency in testing and deployment?

Etsy’s chief architect explains why visibility increases collaboration and builds confidence.
Part of
Issue 10 August 2019

Testing

From development to testing to deployment, transparency is the key to efficiency and success at every point in the engineering process. When there are no silos, it’s easier to work off of each other’s ideas and to fix mistakes earlier in the development cycle. Building visibility into the development workflow facilitates increased collaboration through a continuous feedback loop among team members. When something unexpected happens, engineers from across teams can jump right in: reading each other’s code, viewing the same dashboards and telemetry, and offering ideas and suggestions to make improvements in real time.

A visibility-focused philosophy should serve as a foundation for all efforts. At Etsy, for example, we’re active practitioners of DevOps and are dipping our toes into SRE. Our infrastructure engineers work hard to build and provide the systems, tools, and practices that allow our product engineers to roll out new products and features with confidence. And confidence allows talented engineers to push the envelope and innovate in ways that would otherwise be impossible.

Develop in one place

Developing in a centralized location makes transparency easier to achieve, so take an approach that enables everyone to keep track of the latest changes in one place.

Etsy.com is powered by “Etsyweb,” a mono-repo that contains all the code needed to render the site—PHP, JavaScript, Mustache templates, and the associated CSS. We do trunk-based development, rather than long-lived branches that might be merged infrequently. What this means in practice is that new code is turned on or off by feature flags, which are an easily changeable configuration stanza.

For example, the first commit of an awesome new feature might look like this:

class LegacyFeature {
public function doStuff {
$this->doThisFirst();
$this->awesomeNewFeature();
$this->doThisLast();
return;
}

public function awesomeNewFeature() {
if (Feature::isEnabled('awesome_feature')) {
StatsD::increment('awesome_feature_enabled');
Tracer::startSpan(['name' => 'awesome_feature']);
// add code here
}

return;
}
}

Class EtsyConfiguration {
...
$GLOBALS['config']['product_feature_a'] = 100;
$GLOBALS['config']['awesome_feature'] = 'off';
...
}

This practice encourages a culture of incremental development, and ultimately transparency, by allowing for collaborative feedback throughout the whole engineering process.

For any team, the goal should be a process where, once a meaningful amount of code is written, engineers commit it locally, run the test suite, merge it with the master, and deploy it to production. This method has several key advantages. Since code is added in small increments and the master branch is what is in production, it reduces the risk of deploying new code because every deploy is relatively small and hence easy to debug if anything goes wrong. Also, frequent rebasing minimizes merge conflicts, and helps to avoid the integration hell that products often go through when multiple teams are building them.

At Etsy, we’re able to validate new product ideas more quickly using this process. At the start of a new project, we can gain confidence without having to build and rollout a full-fledged user-facing product. Instead, we can launch a minimum viable product and add more functionality as our users demand it.

Look at the full picture

The key to launching new products and features with confidence is to have the important information front and center, and the rest easily accessible for when it’s needed. Engineers can act quickly when they don’t have to spend time looking for siloed information. Keep your codebase, revision history, and data accessible to all stakeholders to facilitate rapid, informed decision-making.

In the snippet of code on the previous page, there’s a callout to two instrumentation systems. The first is to StatsD, a lightweight UDP-based daemon that supports simple counters and timer aggregations. The second is to OpenCensus, a richer but more resource-intensive distributed tracing API. At Etsy, our engineers use StatsD to build dashboards to watch during testing and deployment, and OpenCensus to further understand anomalous behavior.

We’re a blameless culture. We use incidents (when they inevitably arise) as learning opportunities, and we rely on our visibility tooling to generate permanent referenceable artifacts we can look back on to better understand what happened. As counterintuitive as this may sound, this means we see mistakes as valuable learning opportunities that benefit the whole team. Instead of hiding their errors, engineers can share and learn from their errors when there’s no fear or guilt involved.

Test early and test often

Testing provides invaluable visibility into the impact that new code has on our overall product and allows us to catch any mistakes before they snowball into a larger issue. Invest in fast, repeated testing to shorten the iteration cycle for engineers.

Over the years, Etsyweb has accumulated a fairly large set of unit and integration tests. Our development environment ships with a command-line tool called try. Upon invocation, it generates and submits a diff to a Jenkins server, which applies the patch and runs the test suite.

We encourage all engineers to run try early and often, and we have a service-level objective, or SLO, of eight minutes for the runtime. There are two teams that monitor and track this metric: The infrastructure team ensures there is sufficient capacity to handle the load, and the frameworks team ensures there’s enough parallelization within the test runners to use all available resources. Set measurable goals to meet during testing, and have team members monitor the results to make sure nothing slips through the cracks.

Deploy via trains

Deploying is a high-stakes process involving actions across multiple systems, and as such transparency is even more important. Share broadly the key indicators of deployment progress and application health. Invest in or build tooling (such as Honeycomb, Lightstep, or Zipkin) that makes introspection possible. Being able to look right where two or more systems meet and interact in a running application is crucial, because that’s usually where issues happen.

At Etsy, once we have a green try build, we’re ready to get the code deployed. We have a central deployment coordination channel in Slack called #push. It’s a queue. Engineers organize into groups—or trains, as we call them—of up to eight per train. The first person in the train is the “driver,” who takes action on behalf of the group. When a group gets to the head of the queue, it’s their turn. Everyone in the train merges their changes into master, and when all are done, the driver kicks off the deployment.

Etsy uses a web frontend called Deployinator to execute the series of scripts and/or commands needed to get code from Git onto the many hundreds of servers in production. There are only two buttons to be pressed: Deploy to Staging, then Deploy to Production. The main advantage of using a browser-based interface is that anyone can open the URL and see what’s going on: the state of the current deploy, the run logs, and graphs of key business and system metrics.

The buttons themselves are a deliberate decision. Clicking the button signifies that a human (the driver) is fully engaged and committed to getting the changes live, safely. They’re watching the current deployment; the infrastructure status, as visible on the health dashboard; and other dashboards, logs, and metrics relevant to the current code changes. They’re expected to raise the alarm if anything looks awry.

At the end of every stage, all participants in the train confirm that all is well. Only then does the driver proceed to the next stop. Once the deploy hits production and the driver receives confirmation, the train has reached its destination and the next one in the queue is up.

A collaborative approach to deployment enables the engineers who wrote the code to see the project through to the end. Should any issues arise, these engineers are best placed to make quick fixes since they are the ones who created the code in the first place. In order to maintain visibility and enable human decision-making through deployment, keep engineers active and involved in the process.

Get comfortable with visibility

Some engineers balk at the idea of making their code visible to others before they’ve had the chance to perfect their work. But perfection, or as close as we can get, is best achieved through collaboration. By giving individuals visibility into all parts of development, deployment, and testing, we enable a constant feedback loop early on in the process and keep the work of the individual connected to the project as a whole. The product of transparency is cohesion, which is the mark of useful code.

About the author

Keyur Govande is Etsy’s chief architect, leading the technical direction of the company’s engineering teams. Prior to Etsy, he held engineering roles at both PayPal and Intel.

@keyurdg

Buy the print edition

Visit the Increment Store to purchase print issues.

Store

Continue Reading

Explore Topics

All Issues