What broke the bank – Increment: Testing

In 2018, British bank TSB was stuck in the aftermath of an ugly divorce. Though it had been two years since the financial institution had split from Lloyds Banking Group (the two had originally merged in 1995), TSB was still symbiotically tied to its former partner through a hastily set-up clone of the Lloyds Banking Group IT system. Worse, TSB was paying alimony: £100 million (the equivalent, at the time of this writing, of $127 million) in licensing fees per year.

No one likes paying money to their ex, so on April 22, 2018, at 6 p.m., TSB enacted a months-in-the-making plan to change that: migrating billions of customer records for their 5.4 million customers to the IT systems of Spanish company Banco Sabadell, which had bought TSB for £1.7 billion ($2.2 billion) in March 2015.

Banco Sabadell chairman Josep Oliu had announced the plan two weeks before Christmas, 2017, at a celebratory 1,800-person company meeting in Barcelona’s Palau de Congressos de Catalunya, a cavernous, modern conference hall in the city’s financial district. Crucial to the migration would be a new version of a system developed by Banco Sabadell in the year 2000—Proteo, which had been rechristened Proteo4UK specifically for the TSB migration project.

More than 2,500 years of person power had gone into Proteo4UK, Banco Sabadell chief executive Jaime Guardiola Romojaro boasted to the Barcelona crowd. “The integration of Proteo4UK is an unprecedented project in Europe, a project in which more than 1,000 professionals have participated,” he continued. “It would offer a significant boost to our growth in the United Kingdom.”

TSB chose April 22 for the migration because it was a quiet Sunday evening in mid-spring. The bank’s existing IT systems had been offline for most of the weekend as the Proteo4UK project took place, and as customer records were shifted from one system to another. Flipping the switch to recommence public access to bank accounts late on a Sunday evening would allow the bank a slow, smooth entry back into service.

But while Oliu and Guardiola Romojaro were buoyant at the pre-Christmas company meeting, those at TSB who were actually working on the migration were nervous. The project was meant to take 18 months, but it had been running behind schedule and over budget. After all, shifting an entire company’s records from one system to another is no mean feat.

They were right to be nervous.

Twenty minutes after TSB reopened access to accounts, believing that the migration had gone smoothly, it received the first reports of issues. People’s life savings were suddenly missing from their accounts. Tiny purchases had been incorrectly recorded as costing thousands. Some people logged on and were presented not with their own bank accounts but with those of completely different customers.

At 9 p.m., TSB officials notified the UK’s financial regulator, the Financial Conduct Authority (FCA), that something had gone wrong. But the FCA had already taken notice: TSB had screwed up massively, and consumers were up in arms. (In the 21st century, no unhappy customer is ever very far from Twitter.) The FCA, as well as the Prudential Regulation Authority (PRA), another UK financial regulator, came calling around 11:30 p.m. that same night. When they managed to get TSB officials on a conference call just after midnight—now the morning of Monday, April 23—they had one question: What is going on?

Though it would take some time to understand, we now know that 1.3 billion customer records were corrupted in the migration. And as the bank’s IT systems took weeks to recover, millions of people struggled to access their money. More than a year on from TSB’s weekend from hell, experts think they’ve identified the root cause: a lack of rigorous testing.

Bank IT systems have become more complex as customer needs—and expectations—have increased. Sixty years ago, we would have been happy to visit a local branch of our bank during operating hours to deposit money we had in hand or to withdraw it over the counter with the help of a teller. The amount of money in our account directly correlated with the physical cash and coins we handed over. Our account ledger could be tracked using pen and paper, and any sort of computerized system was beyond customers’ reach. Bank employees put traditional card and paper-fed data into giant machines that would tabulate totals at the end of a day’s or week’s trading.

Then, in 1967, the world’s first automated teller machine (ATM) was installed outside a bank in north London. It changed everything about banking—and required a significant shift in the way that banks interfaced with their consumers. Convenience became the watchword, and this principle positioned customers closer than ever to the systems that kept banks running behind the scenes.

“The IT systems a long time ago were pretty much only used by bank employees, and they could pretty much continue running the bank doing only paper things over the counter,” explains Guy Warren, chief executive of ITRS Group, a supplier of technology to 190 banks worldwide. “It wasn’t really until ATMs and then online banking came in that the general public were accessing the bank’s IT systems directly.”

ATMs were just the beginning. Soon, people were able to avoid queues altogether by transferring funds over the phone. This required specialized cards inserted into hardware that could decipher the dual-tone multifrequency (DTMF) signals, which would translate a customer pressing “1” into a command to withdraw money, and “2” into an order to deposit funds.

Internet and mobile banking have brought the customer ever closer to the main systems that keep banks running. Though separate setups, all these systems have to interface with one another and with the core mainframe, triggering balance transactions, updating cash transfers, and so on.

The typical high street retail bank runs its core banking system on a mainframe computer, says BLMS Consulting’s Brian Lancaste, who spent 13 years working at IBM and several more years overseeing the technical departments responsible for the IT systems of HSBC, and who now consults for banks and building societies (community-run lenders accountable to their customers) across the UK. “That’s probably the most resilient platform you can base that core banking system on,” he says, “and it’s probably the most scalable.” The core customer database sits on that mainframe, along with various sets of IT infrastructures, including lots of servers, in order to build an application interface to the mainframe to allow internet access.

Few customers likely think about the complexity of the data movement that occurs when they log into their online bank account just to load and refresh their information. Logging on will transmit that data through a set of servers; when you make a transaction, the system will duplicate that data on the backend infrastructure, which then does the hard work—shifting cash around from one account to another to pay bills, make repayments, and continue subscriptions.

Now multiply that process by several billion. Today, 69 percent of adults around the world have a bank account, according to data compiled by the World Bank with the help of the Bill and Melinda Gates Foundation. Each of these individuals has to pay bills; some make mortgage repayments; many more have a Netflix or Youkou Toudou subscription. And they’re not all in the same bank.

A single bank’s numerous internal IT systems—mobile banking, ATMs, and more— don’t just have to interface with each other. They also have to interface with banks in Bolivia, Guatemala, or Brazil. A Chinese ATM has to be able to spew out money if prompted by a credit card issued in the United States. Money has always been global. But it’s never been so complicated.

“The number of ways you can touch a bank’s IT systems has increased,” says Warren, the ITRS Group executive. And those systems rarely age out of use. New ones, however, continue to come in.

“If you take all the platforms that touch all the different customer bases, and think of all the hours they need to be available, it’s inevitable that you have a problem,” Warren explains. Success is measured by “how good your systems are at repairing themselves, and how good you are at handling a significant outage.”

TSB’s systems weren’t great at repairing themselves. The bank’s team struggled with handling a significant outage, too. But what really broke TSB’s IT systems was their complexity. According to a report compiled for TSB by IBM in the early days of the crisis, “a combination of new applications, advanced use of microservices, combined with use of active-active data centers, have resulted in compounded risk in production.”

Some banks, like HSBC, are global in scale and therefore have highly complex, interconnected systems that are regularly tested, migrated, and updated. “At somewhere like HSBC, that sort of thing is happening all the time,” says former HSBC IT leader Lancaster. He sees HSBC as a model for how other banks should run their IT systems: by dedicating staff and taking their time. “You dot all the i’s, cross all the t’s, and recognize that [it still] needs a considerable amount of planning and testing,” Lancaster says.

With a smaller bank, especially one without extensive migration experience, getting it right is that much more of a challenge.

“The TSB migration was a complex one,” Lancaster says. “I’m not sure they’d got their heads around that level of complexity. I got a very strong impression they hadn’t worked out exactly how to test it.”

Speaking to a UK parliamentary inquiry about the issue weeks after the outage, Andrew Bailey, chief executive of the FCA, confirmed that suspicion. Bad code likely set off TSB’s initial problems, but the interconnected systems of the global financial network meant that its errors were perpetuated and irreversible. The bank kept seeing unexpected errors elsewhere in its IT architecture. Customers received messages that were gibberish or unrelated to their issues.

“To me, that denotes the absence of robust regression testing, because these banking systems are connecting to a lot of outside systems, such as payment systems and messaging systems,” Bailey told members of Parliament. “These things that crop up, when you put a fix in, that you weren’t expecting, get you back to the question of testing.”

Others agreed. IBM experts who were brought in to analyze what had gone wrong didn’t couch their criticism of the bank one bit. They said that they “would expect world-class design rigor, test discipline, comprehensive operational proving, cut-over trial runs, and operational support set-up.” What they found was something different: “IBM has not seen evidence of the application of a rigorous set of go-live criteria to prove production readiness.”

TSB had walked into a minefield, and the bank seemingly had no idea.

“There’s a lot of complexity behind the technology being used, and that complexity manifests itself in various ways,” explains Ryan Rubin, an IT expert who has previously worked for EY, and who is now the managing director of Cyberian Defence, a consultancy helping big firms manage cyber risk. “It could lead to downtime and exacerbated events, like we’ve seen.”

Warren explains that UK banks will often aim for a target of “four 9s” availability—meaning that their services are accessible to the public 99.99 percent of the time. In practice, this means that an IT system required to be available every single hour of the day, as online banking is, could be offline for 52 minutes per year. “Three 9s”—99.9 percent availability—doesn’t sound all that different, but it’s equivalent to more than eight hours of downtime a year. “For a [British] bank, four 9s is fine, three 9s is not,” says Warren, who recalls that the first software project he ever advised on was a six 9s project—a control system for a nuclear power station.

Every time a company effects a change in its IT infrastructure, it runs the risk of something going wrong. Reducing the changes can help avoid issues, while changes that are required need rigorous testing—something IBM highlighted as absent in the TSB outage.

Shujun Li, who teaches cybersecurity at the University of Kent and who consults for large organizations (including one large bank and a number of insurers), says that every upgrade and patch comes down to risk management—particularly when dealing with hundreds of millions of dollars’ worth of customers’ funds. “You need to have a procedure making sure the risks are managed properly,” he says. “You [also need to] know, if it goes wrong, how much it will cost in terms of money and reputation.”

Careful planning could mitigate the risks of such downtime in a way that TSB didn’t seem to factor in. “Failures will continue to happen, but the cost of applying resilience and having redundancy has come down,” Rubin says. Storage costs have fallen as network providers and cloud solutions have risen. “These things are all there, which can help the banks to manage their risk and fail gracefully when disaster strikes.”

Still, securing backup plans in the event of disaster may be too costly for some institutions. Warren believes that some banks have become overcautious in how they approach IT resiliency. “You can’t do this on a budget,” he explains. “This is a financial service: Either it’s available, or it isn’t. They should’ve spent more.”

Miserly IT spending ultimately levies a toll. TSB reported a £105.4 million ($134 million) loss for 2018, compared to a £162.7 million ($206 million) profit in 2017. Post-migration costs, including compensating customers, correcting fraudulent transactions (which skyrocketed in the chaos and confusion of the outage), and hiring help totaled £330.2 million ($419 million). The bank’s IT provider, Sabis, was sent a bill for £153 million ($194 million) for its role in the crisis.

Perhaps the easiest way to avoid outages is to simply to make fewer changes. Yet, as Lancaster says, “every bank, every building society, every company is pushed by the business to build more and more good stuff for the customers and good stuff for the business.” He observes, “There’s a drive to get more and more new systems and functionality in so you can be more competitive.” At the same time, companies—particularly financial ones—have a duty of care to their customers, keeping their savings safe and maintaining the satisfactory operation of existing services. “The dilemma is how much effort do you put into keeping things running when you have a huge pressure from the business to introduce new stuff,” Lancaster says.

Reported technology outages in the financial services sector in the UK increased 187 percent from 2017 to 2018, according to data published by the FCA. Far and away, the most common root cause of outages is a failure in change management. Banks in particular require constant uptime and near-instantaneous transaction reporting. Customers get worried if their cash is floating about in the ether, and become near riotous if you separate them from their money.

A matter of months after TSB’s great outage, the UK’s financial regulators and the Bank of England issued a discussion paper on operational resilience. “The paper is trying to say to the financial organizations: Have you tipped this balance too far in bringing stuff in and not looking after the systems you have on the floor today?” Lancaster explains.

The paper also suggests a potential change to regulation—making individuals within a company responsible for what goes wrong within that company’s IT systems. “When you personally are liable, and you can be made bankrupt or sent to prison, it changes [the situation] a great deal, including the amount of attention paid to it,” Warren says. “You treat it very seriously, because it’s your personal wealth and your personal liberty at stake.”

Since TSB, Rubin says, “there’s definitely more scrutiny. Senior managers can’t afford to ignore or not invest enough in their technology estates. The landscape has changed now in terms of fines and regulatory expectation.”

But regardless of what lessons have been learned from TSB, significant outages will still occur. They’re inevitable.

“I don’t think it can ever go away,” Warren says. Instead, people have to decide: “What’s an acceptable level of availability, and therefore outages?”

About the author

Artwork by

Topics

Buy the print edition

Continue Reading

Security

Chris Stokel-Walker

The mystery of steganography

Programming Languages

Chris Stokel-Walker

Julia: The Goldilocks language

Open Source

Chris Stokel-Walker

Voting for transparency

Teams

Chris Stokel-Walker

The team that powers VLC

Frontend

Chris Stokel-Walker

The rise of React

Containers

Chris Stokel-Walker

Containers for the future

Documentation

Chris Stokel-Walker

Let’s talk about docs

Software Architecture

Chris Stokel-Walker

Illuminating the grid

Mobile

Chris Stokel-Walker

Apps on demand

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call