In October 2008, Neil Hunt, chief product officer at Netflix, gathered a meeting of a dozen or so of his engineering staffers in The Towering Inferno, the secluded top-floor meeting room at Netflix’s Los Gatos, CA headquarters. The room, which Netflix CEO Reed Hastings occasionally commandeers as his personal office, is away from the main office hustle and bustle of the start-up company, up a flight of stairs and across an outdoor wooden walkway up on the building’s rooftop—the ideal place for big-picture thinking.
Big thoughts were needed, because Netflix had a problem: its backend client architecture was, to put none too fine a term on it, crumbling more than the Colosseum and leaning more than the Tower of Pisa.
“We kept having issues with connections and threads,” Hunt recalled at an industry conference in Las Vegas, NV, six years later. “At one point we upgraded the machine to a fantastic $5 million box and it crashed immediately because the extra capacity on the thread pools meant we ran out of connection pools more quickly.”
It was an unenviable position to be in for the firm, which had introduced online streaming of its vast video library the year before. Netflix had just partnered with Microsoft to get its app on the Xbox 360, and had agreed to terms with the manufacturers of Blu-ray players and TV set-top boxes to service their customers. Millions of potential users of a new, game-changing technology were about to encounter what we now know as the multi-billion dollar industry of online video streaming that would transform Netflix from a failing company that mailed DVDs to movie buffs into a television and movie studio that rivaling some of Hollywood’s biggest names.
But, back in 2008, with a backend that couldn’t cope, the public wasn’t about to encounter anything—unless Netflix made some changes.
There were two points of failure in the physical technology, Hunt explained to the conference audience in Las Vegas: The disk array that ran Netflix’s database—a single Oracle database on an array of Blade webservers—and the single box that talked to it.
“We knew we were approaching a point where we needed to make this redundant,” said Hunt. But Netflix hadn’t yet forked out the cash for second data center that would alleviate the problem. “We were vulnerable to those single points of failure.”
“Let’s rethink this completely, go back to first principles, and think about doing it in the cloud.”
That much became abundantly clear in 2008 when the company pushed a piece of firmware to the disk array. It corrupted Netflix’s database, and the company had to spend three days scrambling to recover. (One contemporary news story on the outage—and the customer outrage it sparked—noted that some customers even went back to Blockbuster, which Netflix had made seem decrepit, for their DVDs.) “That wasn’t a total catastrophe because most customers weren’t reliant on the system being up to get value from the service,” explained Hunt—but as Netflix’s DVD mailing arm wound down and its new streaming service caught on, it would become a problem.
“We thought: ‘Let’s rethink this completely, go back to first principles, and think about doing it in the cloud’,” said Hunt.
Over the course of several meetings in The Towering Inferno, Hunt and his team thrashed out a plan that would ensure that database corruption—and the many other issues with connections and threads that seemed to plague the company back in 2008—would never happen again. They’d move to the cloud.
Whether companies are looking to run their applications serving millions of users or to underpin the databases and file servers of multinational businesses, the cloud provides a low-cost, flexible way to ensure reliable IT resources. Firms don’t need to worry about the physical upkeep of their own private data centers storing information; they can build out capacity as and when it’s needed, lowering costs and increasing their adaptability—important features for a young startup with unpredictable (and potentially limitless) growth. It has been a recent boon, born out by technological innovation, that helps power hundreds of thousands of companies, big and small, across the globe.
For Netflix, the move to the cloud proved a prescient decision: between December 2007 and December 2015, the number of hours of content streamed on Netflix increased one thousand times, and the company had eight times as many people signed up to the service at the end of its cloud migration process as it did at the start. Cloud infrastructure was able to stretch to meet this expanding demand while traditional server racks in a data center were not able to (the number of requests per month called through Netflix’s API outstripped the capacity of its traditional data center near the end of 2010). It also proved to be a major cost-saving move.
But at the same time, the cloud was still an unproven, young technology. Amazon, the current leader in cloud computing, had only been offering their Amazon Web Services (AWS) infrastructure products since 2006. Caution was required. Netflix started small, moving over a single page onto AWS to make sure the new system worked. “It’s nicely symbolic,” said Hunt. “We recognised that along the way we probably need to hire some new skills, bring in some new talent, and rethink our organisation.” The company chose AWS over alternative public cloud suppliers because of its breadth of features and its scale, as well as the broader variants of APIs that AWS offered.
“When Netflix made the decision to go all-in on the cloud, most people were barely aware the cloud existed.”
Today, the cloud is many companies’ first choice when it comes to storing data and serving their customers. AWS is a $12 billion company, four times bigger than it was in 2013. It has—and has long had—a 40% market share in the public cloud sector, much more than the combined market share of Microsoft, Google and IBM’s cloud offerings combined, according to data collated by Synergy Research Group. Those that aren’t utilising the cloud often feel they want to, and are frustrated when they can’t: Four in 10 businesses have critical company data trapped in legacy systems that can’t be accessed or linked to cloud services, according to a survey by market research company Vanson Bourne for commercial software company Snaplogic, while three in four say that their organization misses out on opportunities because of disconnected data. Vendors’ revenue from the sales of infrastructure products—including server, storage and Ethernet switches—for cloud IT topped $8 billion in the first quarter of 2017, according to analysts IDC.
But none of that was the case when Netflix started its great migration, nor was it true when Ruslan Meshenberg started at Netflix in January 2011, two years into Netflix’s big move. As one of the first companies to move its services into the cloud, Netflix was literally writing the rulebook for many of the tasks it was undertaking. Meshenberg was thrown in at the deep end.
“That was the very first set of objectives I was given,” he explains. “A complete data-center-to-cloud migration for a core set of platform services. Day one.”
It involved a lot of outside the box thinking—and plenty of trailblazing. “When Netflix made the decision to go all-in on the cloud, most people were barely aware the cloud existed,” he explains. “We had to find solutions to a lot of problems, at a time when there were not a lot of standard, off-the-shelf solutions.”
And the problems, when tackling such an enormous task as the migration of a company the size of Netflix, were numerous—particularly for a team used to the mindset that their system operated in a physical data center.
“When you’re operating in a data center,” says Meshenberg, “you know all of your servers. Your applications are running only on a particular set of hardware units.” The goal for the company in a physical data center is a simple one: keep the hardware running at all times, at all costs. That’s not the case with the cloud. Your software runs on ephemeral instances that aren’t guaranteed to be up for any particular duration, or at any particular time. “You can either lament that ephemerality and try to counteract it, or you can try and embrace it and say: ‘I’m going to build a reliable, available system on top of something that is not.’”
Which is where Netflix’s famed Simian Army comes in. You have to build a system that can fail—in part—while keeping up as a whole. But in order to figure out if your systems have that ability baked into their design, you need to test it.
Netflix built a tool that would self-sabotage its system, and christened it Chaos Monkey. It would be unleashed on the cloud system, wreaking havoc, bringing down aspects of the system as it rampaged around. The notion might seem self-defeating, but it had a purpose. “We decided to simulate the conditions of a crash to make sure that our engineers can architect, write and test software that’s resilient in light of these failures,” explains Meshenberg.
In its early days, Chaos Monkey’s tantrums in the cloud were a dispiriting experience. “It was painful,” Meshenberg admits. “We didn’t have the best practices, and so many of our systems failed in production. But now, since our engineers have this built-in expectation that our systems will have to be tested by Chaos Monkey, in production they’re now writing their software using the best practices that can withstand such destructive testing.”
Even without Chaos Monkey, there were still early setbacks, including a significant outage across North America on Christmas Eve 2012 thanks to an AWS update to elastic load balancers that tipped Netflix offline—a chastening event. But the company adapted, and came through it. By 2015 all of Netflix’s systems—bar its customer and employee data management databases, and billing and payment components—had been migrated to AWS. It would take a little longer before Meshenberg’s team could celebrate a job complete, but the relatively bump-free path (and the easy scaling up of systems as Netflix’s customer base skyrocketed) vindicated the move.
“The crux of our decision to go into the cloud,” says Meshenberg, was a simple one: “It wasn’t core to our business to build and operate data centers. It’s not something our users get value from. Our users get value from enjoying their entertainment. We decided to focus on that and push the underlying infrastructure to a cloud provider like AWS.”
For Netflix, dipping their toe into the water of cloud computing wasn’t an option. They had to dive in headfirst.
That said, making the leap was a brave move—not least given that, particularly when Netflix began its migration in 2009 and even when Meshenberg joined the company in 2011, cloud storage was still a relatively unknown technology in the Valley, and an unknown term to the general public. (The Institute of Electrical and Electronics Engineers (IEEE) held just its fourth ever international conference on cloud computing in Washington DC in 2011; technology analysts Gartner were still able, back in 2011, to publish a $2,000 “Hype Cycle” report explaining a technology that was on the rise.) Though those in the know understood the benefits of migrating to the cloud, and had a hunch that the general consensus would follow them, early adopters were still just that—pioneers pushing out the boundaries for the technology.
Going all-in on the cloud required betting on the future—and hoping that others would follow. But for Netflix, dipping their toe into the water of cloud computing wasn’t an option. They had to dive in headfirst.
“We had little doubt that cloud was the future,” explains Meshenberg. “If it was, it didn’t make sense to hedge our bets and straddle both worlds, because that would mean we would lose the focus of getting something done completely to the end.”
There was another factor in the decision for Netflix, too: scalability. “Our business was growing a lot faster than we would be able to build the capacity ourselves,” Meshenberg recalls. “Every time you grow your business your traffic grows by an order of magnitude, you have to rewrite the rules. The thing that worked for you at a smaller scale may no longer work at the bigger scale. We made a bet that the cloud would be a sufficient means in terms of capacity and capability to support our business, and the rest was figuring out the technical details of how.”
For Raj Patel, considering anything but the cloud was never really an option. Head of Cloud Engineering at Pinterest from 2014–2016, Patel joined a company that still had to engineer another move: from Amazon Web Service’s legacy cloud to a next-generation cloud system. “It wasn’t any different, frankly, than moving from a data center to the public cloud,” explains Patel. “We did a migration inside of Amazon.”
The move was one that some at the startup were wary of, even despite its benefits. “A cloud migration, in many cases, doesn’t necessarily get them anything,” says Patel. “The appeal has to be why they should do this before the five other things they were thinking about doing for their own group.”
At a small, nimble startup like Pinterest, time and resources are scarce, and an engineering team’s to-do list is as long as the sum of their collective arms. Getting people on-side with the cloud migration required deftness, discussion—and categorically not a top-down edict. It also required going person-to-person, winning small victories in support of the larger battle.
“You have to intuitively appeal or influence the motivations of an individual engineer to achieve your goal,” says Patel. “What I found was that at the earlier stages of the program I explicitly looked for folks that are early adopters or have a vested interest in doing that program or project, and you really focus on making them really successful. Then if the others see it they’ll get on board.”
Certain groups at Pinterest had pent-up frustrations with the older generation of Amazon’s cloud service, particularly when it same to the elasticity of potential future expansion. Data engineering-intensive applications ran up against walls with the old cloud server. Patel saw an in.
“We focused on those who would benefit the most,” he says, selling them on the idea of migrating over to a new cloud server, better equipped to deal with the developments they wanted to introduce. Patel’s team provided those early adopters with the tools to help them smoothly migrate over to the new cloud. That included embedding a consultant or solution engineer (rebranded “site reliability engineers” so as not to ruffle any feathers within the groups they joined) with each application team, who was able to provide the relevant tools and know-how to help ease the transition over to AWS. What the site reliability engineers from Patel’s team didn’t do, though, was impose any ideas or tools on the teams they joined.
“We focused on those who would benefit the most,” he says, selling them on the idea of migrating over to a new cloud server, better equipped to deal with the developments they wanted to introduce.
“Any time you do a cloud migration—especially with engineers—there’s always this notion of: ‘Here’s my way of doing it, here’s your way of doing it: What’s the right way of doing it?’,” explains Patel. “If you had an outside group tell you this is the only way you’re going to do it, you’re going to run into a lot of friction.”
Rather, the teams worked collaboratively, engendering a sense of common purpose. Pinterest was, in truth, always going to make the move, and the company could have become forceful with its ideas, but Patel wanted a more consensual approach. “Their success is embedded with that application team,” says Patel. “Even though they might be talking about a central tool or approach, they’re perceived from the perspective of that application team.”
Like a pyramid scheme, the early adopters found success, and became proselytisers for the move. “When they talk to others at lunch, they say the migration is going really well; the guys doing it are really helpful, and it’s going just fantastic,” says Patel. “The next time you talk to the sceptics, they say: ‘Let’s go and do it.’”
At the same time, those systems that had successfully made the cloud-to-cloud migration were crowed about internally. Data democracy was crucial, says Patel, in getting across the message that the migration was something to be welcomed, not shunned. “We had important metrics about the progress we were making and would send it out to the whole engineering team to let them see it,” he explains. “People like data—engineers especially. They resonate with that progress.”
Six months later, Pinterest had transferred its backend to the more modern cloud system. The team held a party to celebrate the successful move, but truthfully, it was just another success for a company that has plenty of them.
“Think about it,” says Patel. “This was a company that was doubling or tripling in size every year. When I joined the company it was making $0 in revenue and the first year it was $100 million or something, then the next year something like three times that amount. That was the norm across the entire company. In some ways, it was just business as usual.”
When Patel moved to Symantec in April 2016 to become vice president for cloud platform engineering, things were far from business as usual.
“The magnitude of challenges are, I’d say, 5× with Symantec,” he explains. “That’s one of the things I’ve come to realize: While it’s interesting to talk about companies like Pinterest, Facebook or Instagram, their problem is already solved. They have some of the brightest engineers in the world, their applications are already designed for these cloud-type elastic architectures. In some ways, the challenge is not that interesting. But when you’re dealing with a 30-plus year-old company like Symantec, the challenge is a lot more interesting.”
For decades, Symantec had provided stability and assurance to customers—important, given its role as a security service. Unlike Pinterest, which was born in the cloud seven years ago, Symantec was founded in 1982, when computers were massive, hulking bits of hardware, hardwired to the wall. The company had been in business before the world wide web appeared as long as Pinterest has been in business, period. A publicly listed company—accountable to shareholders, with $3.6 billion of turnover—comes with more levels of hierarchy than a nimble, community-focused startup born in the Valley.
“There are more business units with general managers, instead of application teams,” explains Patel. “All those barriers are a lot more rigid in a larger enterprise than they are in the more nimble, engineering organisation approach you find in a startup.” There are also people who have been working in the company longer than some of Pinterest’s brightest young engineers—individuals who have decades of experience, and rightly should be listened to when they pass comment on the merits of such a move into the cloud. “Frankly, there were a lot of sceptics, and real architectural challenges in applications that simply have not been designed for the cloud,” says Patel. His work would end up closing down 27 separate data centers around the world and moving everything into the public cloud. The scale seemed almost insurmountable.
Even the business case for convincing staff at Symantec was more difficult; it simply wasn’t as easy an argument to make, because if it ain’t broke, why fix it?
“Your influencing job is probably 5× harder,” says Patel. “Because of the cultural transformation, you have to be a lot more convincing. You’re telling people to work differently which is very difficult, and sometimes the organization has the appetite to do those things, and sometimes they don’t.”
Much like Ruslan Meshenberg felt the need to win over his staff members, and just as Patel had to leverage the enthusiasm of early adopters at Pinterest to convince those who were less keen on taking the leap into the cloud, at Symantec Patel had to undergo a similar “hearts and minds” campaign.
Guided by Patel’s boss, the executive vice president of the sector, his team decided to show, not tell, fellow Symantec staffers about the benefits of cloud migration. “We took all the major classes of application and did a proof of concept for each one of them,” he explains. Patel’s team broke down the challenge, piece by piece, drawing up a technical feasibility study for each application, working with each group’s architect, building a proof of concept that could convince them such a move would work— “as opposed to saying: ‘We’re just going to run off this cliff and it’s going to work.’”
The attitude was a simple one: “Let’s remove the risk, and show that.”
It worked. Conviction built around the move; the only thing left to discuss was how exactly to handle the migration.
Big legacy companies planning a move to the cloud are faced with one of two options: They can go down the lift-and-shift path, or the fix-and-shift route.
The lift-and-shift route is the (comparatively) easy option. You take your pre-existing application as it presently works in a private data center, and make the minimum possible changes before moving it into the cloud. “I understand there’s going to be benefits to moving to the cloud, and I’m probably not going to realise most of them, but we’ll fix it later,” says Patel of the lift-and-shift approach.
Fix-and-shift is harder, but potentially more beneficial. You’re not just going to do the bare minimum work to ensure your application—which worked fine in an offline data center—will work in the cloud. You’re buying into the concept of moving to the cloud, fixing your culture along the way, and making it more adaptable to the new norm.
“A lot of the time what you’ll find is that traditional IT organisations tend to do lift-and-shift,” says Patel. “They’re taking the same thing they had in their private data centres and, whether it’s a corporate mandate or whatever, they say: ‘Let’s just go and move it to the cloud.’ They’re looking for roughly the same technical or organisational approaches to operating in the cloud before the cloud,” he adds. “And in my view, that’s why a lot of those efforts fail.”
It was the same choice that Neil Hunt and his team had considered back in The Towering Inferno conference room. “We could take the existing app, forklift it, and shove it into AWS, then start to chip away at it,” he explained. “That was unappealing. It would be easy to do but we’d bring along a lot of bad architecture and a lot of bad habits.”
Netflix’s second choice was equally unappealing at first glance, simply because of the scale of the task. “We would run our existing infrastructure, and side by side run our AWS infrastructure, and migrate one piece at a time, from one system to another.” As the cloud migration occurred, Netflix totally transformed. Its application also changed from a hulking, single monolithic application to a clutch of small microservices, each of which can be developed independent of the others. It recast the way the company thought about everything, completely changing the shape and makeup of the firm.
Years after Netflix’s brave decision to undergo the wholesale application and infrastructure refactor, Symantec came to the same decision: They’re fixing, then they’re shifting. Patel still has a way to go before he can breathe easily: The process has taken—and will take—time, but he’s hopeful about reaching the finish line that lingers temptingly on the horizon.
“I’ll personally feel a lot more excitement when we’re done here at Symantec, just because we’ll have done so much more organisationally,” he explains.
Patel already knows the jubilation that’s felt when you move an entire company into the cloud, and can’t wait to feel that again. For Ruslan Meshenberg, who had helped guide Netflix into the cloud without any major hitches, there was only one way to celebrate the achievement. It’s what Silicon Valley does best: Hold an amazing party.
“We had some fun, and we shared some battle stories,” says Meshenberg. The team shared a sense of achievement—personally and as a group. “Cloud migration involves every single person in a company, whether they’re engineering or not,” he adds.
Meshenberg, who had only known cloud migration in his time with the company, could move on from the project he was handed on the first day of his job, to task number two. It must’ve seemed easy-going in comparison, you’d think. “Relatively speaking,” he agrees — “but probably not less challenging. The only constant is change itself. Nothing stands still. We have to constantly re-evaluate our assumptions and ensure that our ecosystem evolves as well.”
“Cloud migration involves every single person in a company, whether they’re engineering or not.”
But Meshenberg still holds with him that sense of pride that his team and colleagues pulled off a major cloud migration without much of a hitch—and that they confounded the critics along the way, remaining ahead of the technical curve.
“When we went into the cloud we faced a lot of external scepticism, people saying this will never work, or that it may work but not for us,” he says. “It might not be secure enough, scalable enough—you name it.”
There’s a brief pause, a moment as Meshenberg collects his thoughts. Eventually, he comes out with 10 short words: “It was good to be able to get it done.”