On January 31st, a production engineer at GitLab was troubleshooting a spam problem with their PostgreSQL databases. Thinking that he was logged into the secondary database, the engineer deleted all of the secondary database’s data, only to discover (a little too late) that he had deleted all of the production data on the primary database instead. What followed was a strange and impressive series of events that culminated in a change to Gitlab’s on-call runbooks to ensure that every high-severity outage in the future would be livestreamed to the public on YouTube.
GitLab stores all of their production data for GitLab.com on two PostgreSQL servers: a primary database server that is serving all production traffic and a secondary server that is standing by in case the primary server goes down. On the morning of January 31st, an engineer noticed that the load on the database servers had dramatically increased due to a problem with the way spam accounts were being detected and removed. Due to the heavy load, the automated replication from the primary server to the secondary server started to lag, eventually falling so far behind that it could no longer automatically replicate. The engineer debugging the spam problem noticed that replication was seriously delayed, and decided to delete the data on the secondary server before forcing a manual replication. Unfortunately, the engineer was logged into the primary (not the secondary) server at the time, and accidentally wiped the primary database—deleting all production data. Once the engineer realized that the primary database had been wiped, he contacted the other production engineers who were on-call, and it was all-hands-on-deck at GitLab until the outage was resolved.
This outage stands out as one of the most interesting outages that the industry has seen in the past year, primarily because of the way GitLab decided to share the outage with the public. The industry standard of outage communication is to keep the details of outages internal and very quiet until the issue is resolved, communicating with external clients and users only if it is absolutely necessary. GitLab, however, took the opposite approach, giving the public full outage transparency: the document tracking the incident was immediately made public, debugging the outage was livestreamed on YouTube, status updates were shared via GitLab’s Twitter account, all on-call runbooks were made publicly available, and the postmortem they published contained all of the details of the outage and its follow-up tasks.
Increment spoke with Sid Sijbrandij, GitLab’s CEO, about the outage, its aftermath, and the impressive level of transparency that GitLab offered to its users both during and after the outage. We have edited the interview slightly for brevity and clarity.
Increment: Let’s start with what things looked like before the outage. What did Gitlab’s on-call and incident response process look like? Was there a dedicated incident response or operations team that was trained and prepared to handle these kinds of outages?
Sid: We have a team of people on-call—mostly production engineers—who have escalation policies and runbooks to deal with these incidents. The production engineers are the ones getting paged, and they get paged either because some of the monitoring is going off or because someone uses ChatOps to contact them.
The engineers debugging the database issue on the secondary database—were they following a runbook or incident response process that already existed?
There wasn’t a runbook, because [the debugging had not yet risen to the level of an outage]. They weren’t working on a production outage at the time.
So they were working to debug whatever was happening with the secondary database, then, correct?
When did it turn into an outage, and how did the response then change?
When the alarms went off. The production engineer who was working on the database was not the person on-call, and as soon as he realized that he had made a mistake, he went into the chat rooms and said “look people, I’ve made a big mistake, and we have an incident.” It wasn’t even the monitoring that set it off, although the monitoring went off several minutes after that and the person on-call was paged.
We know from the postmortem what happened after that, and how the outage was debugged, mitigated, and resolved. In the postmortem, there’s also a list of follow-up tasks, all of which are pretty technical in nature: making sure there are appropriate database snapshots, improving runbooks, and the like. One thing we don’t see on this list is a set of organizational changes you might be implementing to prevent this from happening in the future. Are you implementing any organizational changes as a result of this outage?
What happened with the transparency around the issue happened organically, and we think that was a very good part of how we followed up, so we have made sure that it’s an integral part of all the runbooks. That includes escalating to marketing when there’s a major incident, it includes sharing a google doc where we are making changes, and it includes doing a live broadcast. Now, all of that happened organically [this time], but it was a success.
I think that what was less of a success was that some people thought that everyone using GitLab was affected. The issue had an effect on the users of gitlab.com, but most of GitLab’s customers are using our [self-hosted] Community or Enterprise editions, which luckily were not affected. [To avoid this confusion in the future], we will be pinging the marketing team so that they can better describe what the impact is, so that people actually know what the impact is too. There’s a merge request that explains our new communications strategy, and you’ll see that we’ve split it up into minor and major outages. We have a runbook now for escalation to marketing, and getting the livestream online. People are interested in that, so they can see what we’ve changed. The next time, this doesn’t have to be an organic thing: we’ll be transparent by default.
It’s remarkable that the transparency happened organically. What were some of the benefits that you saw come out of this?
The benefit was that developers were empathetic to what happened. If you can understand why something happened, then you are able to empathize. They saw [why it happened], they saw how we felt about it, they saw what we were putting in place to prevent it. It was obviously unacceptable what had happened, and it was a good, good reminder for everyone [else in the industry] not to make the same mistakes. I think people appreciated that we contributed our own horrible experience, and that it has served as, well, as a warning not only to ourselves but to others. I think everyone [who was working on the outage] felt good about contributing to the body of knowledge.
Did publicly debugging help resolve the outage more quickly than you think it would have been resolved otherwise? And did the people who were following the debugging and watching the livestream help resolve the outage?
I don’t think [it helped us resolve the issue more quickly], but I’m not 100% sure. I am for sure thankful to everyone who tried to help us, but I don’t think there was an increase in speed. There was a Postgres expert who joined the calls, and that really helped. I don’t think there was a big speed increase because of it, but we’re not ruling that out for next time. What happened this time may not be the case next time: hopefully next time the mistakes are more complex, and there’s more knowledge needed to get there. On the other hand, if it’s more complex, you might need more knowledge up front about what the situation looks like. But [transparency] won’t slow it down and I think it will, in some cases, speed it up.
Do you think that this level of outage transparency is something that could help others in the industry?
I think that there is a lot of generic advice [about resolving incidents] out there, and what people really crave are specifics. We want to add to that body of knowledge [by providing transparency]. Specifics resonate with people in a way that generic things don’t do. So yes, I think it will be very helpful, but we’re not telling other people what to do, we’re just trying to do our own thing.
For us, being transparent has brought a lot of benefits, and not as many downsides as would be expected. We would do it even if it was hard, but it hasn’t been hard. Now, maybe that’s because it’s ingrained in the whole [GitLab] organization, but we find that most of the time, people appreciate a look inside, and that can compensate for the fact that when you’re getting a deeper look into something, you’ll see more ugly things—something that is especially true for developers, and so if developers are your audience, transparency gets easier.
Is there anything else you’d like to add?
I want to thank everyone for all of the hugops we received. Lots of people sent us kind messages over Twitter, and there were even some deliveries to our office. I think that’s really great. I think people were very accommodating, and we are very determined not to let this happen again. I think people will forgive us our first mistake as long as we don’t repeat our mistakes and we [continue to be transparent].
During our conversation, Sid told Increment that the engineer who accidentally deleted the production database (and who still works at GitLab) included his full name in the public documentation tracking the issue. Sid recommended that he remove his name, and the engineer eventually agreed with Sid, but not until after he had argued that publishing his name was the most transparent thing to do. We think it speaks volumes about GitLab’s culture of transparency that engineers involved in outages feel safe about identifying themselves to the public: it speaks of an impressive, strong, and encouraging engineering culture at a company that recognizes that outages aren’t the result of individual mistakes but the consequence of failures of process and procedure.