What caused Facebook services to shut down?
On 4 October, the Facebook website, other apps and services of the company, such as WhatsApp and Instagram, went down for about six hours. While many social media and online services end up having some extended periods of downtime once in a while, this was one of the longest downtimes from a big company in the last few years, and one which affected more than one service. As many people depend on those services in order to talk to other people and earn their living, this obviously caused some discontent, which people expressed through platforms such as Twitter, Telegram and Discord, as they aren’t part of Facebook and so remained online.
A lot of theories were thrown around trying to explain what happened, but Facebook only released information on it some time after their services had returned. So, what was the problem?
Border Gateway Protocol
Before we get into that, it’s important to have a bit of a background on how the internet works.
The internet is a very large network, so it’s very easy to get lost in it while trying to find a website to connect to. IP addresses exist to help that, assigning an identification to every device connected to the internet. While for home devices those addresses change with time, for websites and web services they don’t change. And they can also have an address associated with them that’s easier to remember, such as “facebook.com”, which are called “domains”. There are servers which translate those domains to IP addresses, allowing you to connect to the server you’re looking for, and they’re called DNS servers.
However, while the IP address gives you the server you need to reach, how do you reach it?
The internet is basically a huge network of ISPs, which are in turn connected to the web servers that you are trying to reach. In order to reach it, you launch a request from your computer that traverses the maze of ISPs until it gets there. Your computer always tries to get through the shortest route, and the ISPs help you by laying it out to you, and do so by means of a Border Gateway Protocol (BGP), that is, the protocol that defines what it tells to packets that are on their “border”.
Web servers also have their own BGPs, and these tell you if you reached the right place and if you sent the correct packet using the right protocol, and they also tell the ISPs about their IP so that they can associate the physical location to the IP and guide packets towards it.
So, what happened?
The official reason released by Facebook was that the outage was a result of a faulty network configuration in their routers. While it’s a true reason and enough for the average internet user, it is also very vague.
Network specialists theorize that what happened was that someone in the company changed the configuration of Facebook’s BGP to a faulty one, and either the router or the employee did not notice that it had errors, and as a result the BGP stopped working entirely. Consequently, the BGP also stopped announcing the server’s IP to the DNS and ISPs, which in turn led them to think that the server was offline, so they couldn’t route people to it. Consequently, no one could reach the Facebook servers anymore.
It’s likely that that faulty configuration also affected the internal BGP, the BGP which guides computers to other computers inside Facebook’s private network, which also prevented employees from getting into the building, as you may have heard about.
Of course, this will serve as a warning to Facebook and other companies that depend on server reliability that big accidents can still happen. Measures will be taken to ensure that mistakes like these are less likely, such as by increasing checking and security measures before any possibly breaking change, and things will continue as normal.