An engineer’s guide to cloud capacity planning

If you’re a small company with big dreams for the future, one of the biggest advantages cloud infrastructure providers have over traditional provisioning systems is the flexibility they offer you to adjust the resources your application uses. It no longer takes opening a ticket with IT to kick off a process of negotiating with a colocation provider for rack space to get some servers installed six weeks from now. The process is now entirely abstracted behind an API call, with servers that will be ready in single-digit seconds.

The dominant provisioning strategy for the old model was simple: overprovision, and overprovision by a lot. The minimum increment you could increase capacity by was typically “one server”, and the human and opportunity costs of changing a deployment was so high that changes needed to be batched. This necessitated deploying more than you’d ever think you need and, if you were wrong, radically expanding the deployment while attempting to frantically optimize within your existing footprint.

The new cloud paradigm lets DevOps teams deploy new resources in a very granular fashion, and the temptation is strong to fight any capacity problem by simply throwing more money at it. That’s often an excellent strategy, but it requires us to get better at doing capacity planning for the cloud.

Table stakes

Getting your capacity planning right is less important than being able to react when you have gotten it wrong. You should adopt architectures and development practices which are amenable to changing a deployed application. Otherwise, you lose the benefit of flexible provisioning—the human costs of changing your application will swamp the benefits of cloud flexibility, and you’ll end up in the pre-cloud world. Worse, you’ll find yourself paying a premium for resources, justified by the flexibility you can’t take advantage of.

Pick an architecture amenable to adding capacity

Most applications will scale using a common of two approaches, horizontal scaling (“buy more boxes!”) and vertical scaling (“buy bigger boxes!”). The industry has cottoned on to a few architectural approaches which play well with this reality, and you probably shouldn’t reinvent the wheel.

You will probably end up using n-tier architecture, with a relatively large number of application servers talking to a relatively small pool of databases (or other backing data stores). This is overwhelmingly the most common model for web application deployment, because it allows you to take advantage of horizontal scalability on your application tier(s) while vertically scaling the database.

This architecture is extraordinarily well-proven in a wide variety of contexts, and scales from prototypes to some of the world’s largest production deployments without any fundamental changes. If your company grows into a Google or a Facebook, you might have to do something more exotic—but at that point, you’ll have thousands of engineers to throw at the scaling problem.

Decouple your application code from knowledge of the deployment environment

While this has been a best practice for decades, a hardcoded reference to an API server here or there wasn’t a huge problem if one only added API servers once or twice in the lifetime of an application. When you’re adding them on a minute-to-minute basis, though, you need your application (and/or deployment/provisioning process) to handle configuration changes for you.

There exist excellent open source options for managing this these days. For example, Consul makes service discovery very easy—servers simply advertise the availability of particular services to Consul. Consumers can find services via an API or, more commonly, simply doing DNS lookups that get distributed over the pool of servers available with a service running.

Particularly for applications using a service-oriented architecture or multiple layers, you will find that a ubiquitous communications substrate makes it easier to resize services at will or add new services as the application’s needs change. There are a variety of options here—NSQ has proven an extraordinarily performant and easy-to-adopt message bus in my experience. Kafka also has many fans. The ubiquity and standardization were very important for our applications, allowing us to concentrate our internal tooling and operational processes on a handful of issues (rather than addressing an O(n^2) interaction between producers and consumers) and simplified debates about how to change the application, since the answer was almost always “expose another event to NSQ.”

The gold standard you’ll want to approach is “No application or service running on your infrastructure not directly involved in controlling the infrastructure needs to be consciously aware of servers joining or leaving the deployment.” It’s a cloud-within-a-cloud; the only knowledge any individual box needs is how to connect to the mesh which routes requests to it and which it should stream its own requests into.

This might sound complicated, but the technology has improved so much recently that it is easily within the reach of the smallest development teams. At my last company, we had a bubblegum-and-duct-tape version of this infrastructure with approximately two weeks of work by a non-specialized engineer who had never used any of the individual pieces before. Other development teams have suffered a lot so that you don’t have to—or at least so that your suffering is concentrated closer to the unique business value provided by your application.

Automate provisioning and deployment

While you could provision cloud resources by clicking around in your provider’s online interface to launch instances and then SSHing into them, this will result in you burning an incredibly amount of time managing servers, dealing with inconsistencies across your fleet, and cleaning up after operator error. You should invest early in automated provisioning for your servers (or other resources) and automated deployment for your application.

At my last company, we used Ansible to automate configuration of our servers after they had been brought up. Chef, Puppet, Salt, and plain old shell scripts are all acceptable options as well. Shipping comprehensive Ansible scripts for provisioning each type of box we had and automating the deployment of our services was very non-trivial, but being able to respond within minutes rather than days to opportunities to optimize our architecture proved more than worth the upfront engineering cost.

In addition to easing capacity optimization, automated provisioning and deployment simplified our operational processes significantly. We were able to treat our boxes as cattle, not pets—our primary step to remediate one-off issues was to simply kill-and-replace the box with the problem rather than attempt to come to an understanding of what the problem actually was. A filled hard disk, a noisy neighbor, failing hardware, or a botched deploy became indistinguishable in our runbooks — “just throw it away; it isn’t worth your time diagnosing and then manually correcting the issue.”

Some applications will eventually reach the scale, and some organizations will eventually reach the maturity, where the application itself is responsible for auto-scaling and auto-healing. This is probably overkill for most readers, because the upfront complexity increases substantially. Instead, you’ll likely have developers or operations teams adjust resourcing manually using a common set of mostly automated tools. This provides a good glide path into auto-scaling, since you’ll have the opportunity to burn in your tooling against “Things That Only Happen In Production” over months or years prior to having computers have to make decisions which are robust against edge cases.

Cloud providers have offerings which purport to handle scaling for you. These help to automate some of the mechanics of responding to either natural growth in demand or intertemporal variation in usage of an application (translation: servers might not need to be awake when users are sleeping). That said, you probably need to be sophisticated enough in provisioning and operations to manually respond to swings in demand to enable autoscaling without it either causing incidents in production or running up gratuitously large bills.

Fermi estimates for capacity planning

Early in the lifecycles of most applications, accuracy is overrated and expensive. You’ll initially provision to within an order of magnitude of your expected load, and then adjust as required.

At design time, the most clarifying question is “What do we expect to break first?” At a previous company, for example, we shipped an application with almost a dozen services running under the hood. Doing rigorous scaling estimates for all of them would have been fairly difficult, but it was unnecessary: a combination of high intrinsic utilization plus performance of the chosen technology stack plus observed fiddliness during development made it very, very obvious which service was likely to fail first. This meant it was likely to require the most resources on an absolute basis and also the most engineering and operational time responding to fires. We concentrated our efforts on capacity planning for this service and hand-waved for the rest of them.

After you’ve identified the service to concentrate on, you need to figure out what drives its capacity requirements and what resource is the limiting reagent for it operating.

Looking at drivers

Applications don’t need to scale. Code doesn’t care if it drops requests. Businesses occasionally do care, though, so it is important to think fairly rigorously about what the business requirements for a particular service are.

In our case, the application was a game, and the service we were scaling provided the primary AI and world state for the game world. If the service was over capacity, a portion of our players would be totally unable to play. Perhaps surprisingly for engineers who work in mission-critical business applications, occasional spikes of 90%+ of our users being entirely unable to use the sole application of our company was an entirely acceptable engineering tradeoff versus sizing our capacity against our peak loads. (The difference between our peak load and a typical high water mark for a week was a factor of over 500.)

We instead sized our initial capacity against a target concurrent usage of the game which we thought would represent a healthy business if we were able to sustain it, with the intention of growing our target capacity numbers as the business got healthier. Other businesses might have to support peak loads rather than much lower baseline, steady-state loads.

We expressed our peak load in terms of “active players per hour,” since the design of our system required keeping a persistent process for the duration of someone’s play. Most applications will probably instead use requests per second.

Looking at limiting reagents

Different technology stacks and workloads consume resources in sharply different fashions. For example, Ruby on Rails applications scale horizontally by adding processes, and the processes typically consume proportionally more memory than any other system resource. (Non-trivial Rails apps quickly hit several hundred megabytes of RAM usage in steady state.) Since each Rails process can only service one simultaneous request, one buys extra capacity by buying memory:

Required Memory = (Target Requests Per Second)  
    * (Average Length of Request in Seconds)  
    * (Average Size of Process At Steady State)

So, for an application which wants to service 1,000 requests per second and which has an average length of a request at 350 ms and average size of a process of 250 MB, we need ~350 processes available, which costs us ~90 GB of RAM. Rounding up to 96 GB for some headroom, we could provision twelve boxes with 8 GB each, twenty four with 4 GB each, etc.

There are many features of one’s deployment environment other than RAM, including CPU capacity, hard disk access speeds, and networking bandwidth. We ignore all of these because they are not what empirically runs out first for the majority of Rails applications: memory is the first to go. To a first approximation we’ll never hit our CPU limits on any of our boxes in non-pathological use of the system. If we do, we’ll cross that bridge when we come to it. (We will also add CPU usage to the list of things we monitor in production, because no assumption is so expensive as the assumption which turns out to be both wrong and not known to be wrong.)

This modeling approach doesn’t necessarily work for all stacks or workloads, particularly ones which are very heterogeneous in distribution of response times (for example, when they necessarily use a less-than-reliable API whose performance is not under your control). It’s designed to be cheap to execute and accurate enough to let you get back to building systems, rather than to exactly bracket your hardware requirements.

Estimating performance under uncertainty

If there isn’t a similar rule of thumb available for your stack of choice, you’ll probably have to experiment a bit. There are a number of approaches you can take.

Pull from your hindquarters

How many requests can a CPU handle per second? 10 is clearly too low; computers are fast. 1,000 might be too high; some requests do take a lot of work, some internal services are flaky, and some stacks are intrinsically slow. 100 seems to be a happy, defensible compromise. Just assume 100 a second.

Run a microbenchmark

You can create a microbenchmark which simulates a trivial request/response for your application and then benchmark either a single component of your application or the entire end-to-end data flow. This is something that people have done before; TechEmpower has good suggestions for designing benchmarks and some accumulated results on modern stacks and hardware.

Do load testing of your actual application

You could write scripts which simulate plausible use of your application and execute them, over the open Internet, against your staging environment, dialing up the concurrency knob until something broke. This is hard. Almost no script accurately captures the challenges of production workloads. This is often unnecessary, since you won’t get a much more accurate result than the above approaches, despite the increased engineering cost.

Regardless of which approach you use, after you have estimates for your desired capacity and know how much capacity one unit of resources buys you, capacity planning is an exercise in simple division. You’re not anywhere near done with capacity problems yet, but you have the foundation to get started.

When do we adjust our footprint?

For cutting costs

It’s tempting to spend lots of time working on infrastructure and infrastructure planning, both because it presents novel, hard problems and because it is intrinsically fun. It probably doesn’t contribute enough business value to justify this early in the life of an application, however.

As a quick rule of thumb, if you’re spending less than $1,000 a month on infrastructure, even thinking about optimizing your footprint is a mistake. You’re almost always better off spending the equivalent amount of engineer brainsweat on improving the application or other parts of the business. Many engineers get deeply irrational about cloud spending because of the curious tangibility of it. The $2.40 spent to keep an m3.medium running yesterday feels painful if it’s wasted. It’s important to remember that that instance not being active is not intrinsically more wasteful than an engineer spending 90 seconds walking between desks.

After your spend gets into the tens of thousands of dollars per month—which many applications will never reach!—you’ll have ample justification to regularly revisit how you’re allocating your resources and whether improvements in either your resourcing or your application are worth the engineering time required to capture them. This can be as simple as putting a recurring cloud-cutting party on the calendar; it’s often work that fits well between active projects and, since it can feel enormously productive for relatively little effort, makes a nice activity for buffer days on the schedule.

For increasing capacity

In general, you want to increase in advance of need, as opposed to trailing it. How far in advance, and how far along the forecast growth curve you decide to add, depends on how solid your processes are for adding extra capacity. When the act of adding capacity is painful, risky, or costly, you generally want to add capacity well in advance and overbuy. As the cost of adding capacity gets lower, it becomes possible to do it more frequently in smaller increments.

As a rule of thumb, if adding capacity is a week-long project for you, you probably want to buy 6~12 months down your forecast growth curve. If it is a day-long project, shorten to a month out. If you can do it in minutes, then you can probably purchase a week at a time. It doesn’t make much sense to do changes more frequently than weekly on a manual basis.

If you’re at a sufficient level of DevOps sophistication for your application to take advantage of it’s own usage cycle and dynamically add and remove complexity, congratulations! Run it continuously. Getting here is a very, very difficult project, and the overwhelming majority of applications probably can’t justify it as among the best uses of limited engineering time, even given fairly material infrastructure spends.

Be willing to be wrong

Capacity planning for the cloud isn’t about getting an exactly right answer, or even an approximately right answer. You are optimizing for the planning process being lightweight enough to to not block shipping business value and accurate enough to keep production from crashing without breaking the bank. Even getting approximately there lets you spend your time and attention on issues which more saliently determine the success of your company, like product/market fit and scalably attracting the right customers.

Table stakes

Pick an architecture amenable to adding capacity

Decouple your application code from knowledge of the deployment environment

Automate provisioning and deployment

Fermi estimates for capacity planning

Looking at drivers

Looking at limiting reagents

Estimating performance under uncertainty

Pull from your hindquarters

Run a microbenchmark

Do load testing of your actual application

When do we adjust our footprint?

For cutting costs

For increasing capacity

Be willing to be wrong

About the author

Topics

Buy the print edition

Continue Reading

On-Call

Increment Staff

On-call at any size

Internationalization

Chris Niccoli and Lacey Butler

Beyond translation

Teams

Kevin Stewart

How to build a startup engineering team

Containers

Nočnica Fee

A primer on containers

Containers

Michael Hausenblas

How to cloud native

Containers

Increment Staff

Containers at scale

Mobile

Kamilah Taylor

Ready, set, multi-platform

Mobile

Gergely Orosz

A primer on automated mobile testing

On-Call

Ryn Daniels

Crafting sustainable on-call rotations

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call