Cloudflare black-holed its own traffic for an hour

By Richard Chirgwin on Jun 22, 2022 7:56AM

BGP slip took 19 data centres offline.

Cloudflare has attributed an hour-long outage yesterday to a BGP error that made 19 of its data centres invisible to the Internet.

The company has published a post-mortem of the outage, which was caused by a BGP advertisement that accidentally withdrew route announcements for the affected data centres.

“Unfortunately, these 19 locations handle a significant proportion of our global traffic,” the company said.

“This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations.

“We are very sorry for this outage. This was our error and not the result of an attack or malicious activity."

The company’s timeline shows that the outage began at 6.27am UTC (4.27pm AEST) on June 21, and the case was closed at 8.00 UTC.

As the post explained, Cloudflare has undertaken an 18 month project to convert its busiest data centres to a “more flexible and resilient architecture” it has dubbed “Multi-Colo PoP” (MCP).

Locations using that architecture include Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, São Paulo, San Jose, Singapore, Sydney, and Tokyo.

BGP the culprit

MCP locations use routing instructions that create a mesh of connections, and those routing instructions are carried in the venerable Internet standard called the Border Gateway Protocol (BGP).

Among other things, BGP lets operators define policies governing which IP address prefixes are advertised by routers to their peers, and which peers routers will accept advertisements from.

As the post explained: “These policies have individual components, which are evaluated sequentially. The end result is that any given prefixes will either be advertised or not advertised.

"A change in policy can mean a previously advertised prefix is no longer advertised, known as being ‘withdrawn’, and those IP addresses will no longer be reachable on the Internet.”

And that’s where Cloudflare’s MCP rollout went wrong: “While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes.”

That accidental change made spine routers unreachable over the Internet, making it initially difficult for Cloudflare’s engineers to access them and reverse the change.

The post highlighted how critical the affected locations are: “Even though these locations are only four percent of our total network, the outage impacted 50 percent of total [HTTP] requests.”

As well as making the affected locations invisible to the Internet, there was one more side-effect of the accidental configuration change: it disabled the company’s internal load balancing system.

“This meant that our smaller compute clusters in an MCP received the same amount of traffic as our largest clusters, causing the smaller ones to overload," it said.

The company said it will work on its processes, architecture, and automation to avoid a repeat of the incident.

Got a news tip for our journalists? Share it with us anonymously here.