Case study: How Akamai weathered a surge in capacity growth

Seeing a year’s worth of capacity growth in a matter of weeks, the CDN services provider hustled to build and reinforce the infrastructure it needed to serve its users (and European soccer fans).
Part of
Issue 16 February 2021

Reliability

As the coronavirus forced people around the world into quarantine in early 2020, Akamai—the 22-year-old company that runs more than 300,000 servers keeping nearly 1,500 content delivery networks (CDNs) online—saw a year’s worth of capacity growth in a matter of weeks. Internet traffic levels were soaring: 75 percent higher than the February average in Italy as its stay-at-home orders were enacted, and as similar orders unfolded in the U.S. in March, peak traffic measured a third higher than February levels.

More than half of Fortune 500 companies rely on Akamai to keep their services running, along with 225 game publishers, 200 national government agencies, and social media providers worldwide. In all, the company says, 85 percent of internet users are a single hop on a network away from a CDN run by Akamai.

Quick thinking to repurpose preexisting capacity, lateral planning around supply chains, and some good old-fashioned luck meant Akamai was able to keep its networks—and the platforms and services it supports—online throughout the pandemic to date.


Akamai’s first stroke of luck was a planned—but ultimately scrapped—bonanza for the world’s most popular sport. European soccer championships were scheduled to take place between June 12 and July 12, 2020, in 12 cities across the continent. While the tournament was postponed to roughly the same dates in 2021, its presence on the sporting calendar was expected to test Akamai’s video streaming capacity, and the firm had built in headroom to accommodate it.

“We overbuilt so we’d have enough capacity on top, and we had safety margins of a couple of terabits left and right,” says Christian Kaufmann, vice president of network technology at Akamai.

However, increasing capacity to near-total usage is insufficient for shoring up reliability under extreme stress—and having extra capacity in Europe did nothing for American internet users who were also stuck at home and, among other things, spending more time on streaming services. Netflix, for instance, doubled its expected subscriber growth in the first quarter of 2020, with a reported record of 15.77 million paid subscribers globally. A key challenge Akamai faced was maintaining the reliability buffer on its CDNs while accounting for increased usage.

Akamai builds CDNs as close as possible to the end user and in consultation with internet service providers (ISPs). Minimizing latency matters, so the company offers various services to prioritize it. One service places Akamai servers in the ISP’s own network and evenly distributes them geographically, or places them in a central data center close to the ISP. TCP handshakes are a vital element of smooth data downloading, and millisecond-level latency can have an impact on the amount of data users can download at the expected speed. If you’re visiting a website in the U.S., it doesn’t make a meaningful difference whether an end user in New York receives their data from a center down the road or in Milwaukee, says Kaufmann, “but if you’re talking about an HD or 4K video, it pretty much has to be in the same country, ideally in a nearby metro [area].”

What enabled Akamai to maintain high reliability is that its servers are spread around the world. Traffic that requires reliable, close, low-latency connections is routed to nearby CDNs; general web traffic or large patches for operating systems or game downloads go through more remote servers, called spillover or overflow servers.

“As long as you have enough servers and bandwidth, which we had because we built it ahead [of time], the platform in itself is just [pointing] you in the right direction,” says Kaufmann, adding that the principle of building bases worldwide “scales very well.”

Still, during the early days of the pandemic—as the world, and working practices, first entered a state of flux—the increase in usage ate into the buffer Akamai likes to keep around its CDNs.

Pre-pandemic, Kaufmann says, the solution to depleted buffer capacity was relatively straightforward: build some new servers, and build out existing ones to bolster capacity. But that strategy relies on the vagaries of international trade, and turns a problem of handling, directing, and transferring data into a problem of handling, directing, and transferring physical parts. The spare capacity they’d built out for the soccer tournament could only buy them time—at some point, Akamai would have to rely on physical infrastructure changes.

“We need servers, routers, switches,” says Kaufmann. “We have to put them in data centers, and all of them have challenges.”

The majority of hardware suppliers for the world’s data centers are based in China, where a nationwide quarantine in early 2020 brought everything, from production on factory lines to distribution of key parts, to a halt.

Akamai had some additional hardware capacity—spare servers in warehouses it owned, in countries that weren’t under such restrictions—but they quickly used it up. The company started holding weekly internal supply chain calls to identify when and where parts were being detained. Getting the new servers and their complementary parts remained unpredictable as pandemic restrictions shifted.

When the servers were finally built, installation in data centers posed a new challenge. “In some countries, we were regarded as critical infrastructure and were allowed to go to the data center and build stuff,” says Kaufmann. “In others, we had to wait until the lockdown lifted. You end up with big lists of countries and cities, [what you’re allowed to do where], and you try to manage that, because it doesn’t make sense to ship a server where you can’t install it.”

To reduce inefficiency and keep servers online, Akamai built models to identify countries it believed would shortly enter or leave quarantine, trying to time deliveries to coincide with when they’d be able to access data centers. “It was clear that every country would get hit sooner or later, and most of them would go into lockdown,” he says. “You’re trying to manage that just in time.”


From May through September 2020, COVID cases leveled off for many countries, which provided something of a reprieve from these pain points—and time to build the hardware supply chain back up.

Akamai no longer fears it’ll be blindsided by an explosion in data usage. “You’ll get bigger files and more streaming for sure,” Kaufmann says. But, whether you download Call of Duty or watch three hours of streaming video a night, “you don’t have more eyes or more time.” However, the company does worry that the maintenance of already installed hardware, and the rollout of new servers to data centers worldwide, could once again become as difficult as it was in April 2020, when borders closed and international supply chains stalled.

Nevertheless, the company’s efforts to know more about its supply chain at the height of quarantine, when resources were stretched and pressure was highest, has put them on a solid footing, Kaufmann says. He’s hopeful that level of detail will see Akamai reliably through just about anything.

About the author

Chris Stokel-Walker is a UK-based features journalist for The Economist, Bloomberg, the BBC, and Wired UK. His first book, YouTubers, was published in 2019, and his second, TikTok Boom, was published in July 2021.

@stokel

Buy the print edition

Visit the Increment Store to purchase print issues.

Store

Continue Reading

Explore Topics

All Issues