16 October 2021

The Facebook Outage Also Highlights the Internet’s Aging Foundations

Emily Taylor

Facebook had a week from hell last week, even without the testimony to Congress of whistleblower Frances Haugen. Two separate technical outages knocked out its entire suite of services—Facebook, Messenger, WhatsApp, Oculus and Instagram. The incident highlights the fragility that a massive consolidation of resources brings to the global information and communications network, caused by the emergence of supernodes such as Facebook and its Big Tech rivals. It also reveals the disparity of public debate surrounding social media platforms on one hand and the internet’s foundational protocols on the other. (In the interests of transparency, Facebook is a client of my company, Oxford Information Labs, but the information in this piece is derived only from public sources.)

The second Facebook outage—which occurred on Friday, Oct. 8—was bad; a configuration error brought down some services for a proportion of users. But the preceding outage on Monday, Oct. 4, was awful, a disastrous spiral of events that led to a single configuration error taking out some of the world’s most popular digital services for six hours.

Initially, there was speculation that Monday’s events were the result of a hack, cleverly timed to coincide with Haugen’s damning evidence to Congress. It wasn’t. Instead, the cause was a combination of a routine update gone wrong and the perverse impact of security measures designed to prevent unauthorized updates and optimize user experiences and load times.

A software command intended to assess the available capacity of Facebook’s global network contained an error that accidentally disconnected Facebook’s data centers globally. The audit system designed to weed out errors in commands affecting the network had a bug and didn’t detect the mistake.

In addition to the outage of all Facebook’s customer-facing systems, there were reports that much of Facebook’s internal work was affected, too. A New York Times reporter claimed that some employees couldn’t even access the building. A subsequently deleted post from an individual claiming to be part of the recovery team appeared to confirm this.

The impact of the engineering errors was amplified due to the interplay with two protocols that underpin the global internet—the domain name system, or DNS, and border gateway protocol, or BGP.

The domain name system is a bit like the internet’s address book, providing human-memorable names for resources on the network—such as facebook.com—which can then be translated into the numbers that machines use to identify resources, known as IP addresses. Border gateway protocol performs several functions relating to the routing of messages within the distributed network of networks that is the internet. Those individual networks, or autonomous systems, have a dual role. They are both the end points where communications originate or terminate, and they are nodes on the network that can pass on packets of data between end points. BGP acts in part like a post office system, advertising the presence of the individual networks or autonomous systems so that messages can reach their final destinations. BGP also acts as a kind of map, by publishing routing tables that enable other servers to determine the most efficient route for traffic between source and destination.

In cybersecurity circles, the received wisdom is that when things go wrong, it’s always DNS. And so it turned out, as Santosh Janardhan, Facebook’s vice president in charge of infrastructure, explained in a blog post. A feature designed to optimize user experience and speed up loading times led to Facebook’s DNS servers disabling its BGP advertisements. Because those DNS servers controlled access to Facebook’s entire suite of services as well as many internal systems, Facebook simply disappeared from the internet.

In addition to revealing that even the most well-resourced tech companies can be vulnerable to human error, the outages also highlight how rapidly internet markets have concentrated into the hands of a few powerful players, of which Facebook is one.

The direct impact was obvious to Facebook’s billions of users, including a growing number of small businesses that use the company’s platforms as their principal online location. Facebook’s stock price initially fell by an estimated $50 billion.

Less obvious—and more concerning at the level of systemic policy—were the indirect impacts and where they were felt. According to Cloudflare, the outage had its greatest effect in developing countries and regions, with Turkey, Grenada, Congo and Lesotho at the top of the list. For users of Facebook’s “Free Basics”—a kind of internet-lite provided through a Facebook portal in some developing countries—the entire internet would have gone dark for the duration of the outage. This supports the view that developing countries are more likely to be consumers rather than creators of technology platforms. Cloudflare also reports a “massive spike” in server failure responses during the outage, which even slowed the load time of websites that embed Facebook scripts in their pages to give their users access to “Like” buttons or comments from the platform.

Last week’s incidents also reveal the systemic dependence of the global network on protocols that go back to the 1980s, such as the DNS and BGP. Those simple, lightweight, interoperable protocols enabled exponential growth of the network. Yet, hardly anyone understands them. As a result, many policymakers who are currently making plans to regulate Big Tech and social media platforms seem to think that Facebook, Google and a small handful of applications are the internet, leading to poor regulatory choices in some cases.

Another aspect of this knowledge gap relates to how the internet’s architecture is managed and develops. For example, the public engagement and academic scholarship relating to ICANN—which coordinates the internet’s system of unique identifiers, such as domain names—is paltry compared to the engagement relating to social media platforms or regulating Big Tech. A director of a leading research institute told me last week that none of their researchers are currently engaged in work on internet governance.

The lack of engagement and public scrutiny brings both positive and negative effects. The upside is that the relative obscurity of traditional internet governance sometimes creates a collegial environment in which technicians can work together across political divides to solve engineering and policy problems. The downside is that low levels of engagement can facilitate “capture” of policy debates, leading to unbalanced policy outcomes. Also, without the urgency created by broader engagement, security problems relating to basic protocols remain unresolved or underplayed, and transitions to newer technologies take a lot longer than they should.

Facebook’s week from hell will give its systems administrators plenty to think about and will no doubt lead to operational security changes. It also provided a textbook example of the potential dangers of supernodes such as Facebook on the global internet: When they suffer an outage, the network as a whole feels the effect. Above all, no matter how powerful a Big Tech company is, if they use the internet at all, they are reliant on a set of protocols and standards that hardly anyone understands or is engaged in. Though the questions of whether and how to regulate social media platforms get the headlines, the governance of the internet’s foundational protocols requires urgent attention if the global, open, interoperable internet is to be salvaged.

No comments: