US internet service provider CenturyLink has suffered a major technical outage on Sunday after a misconfiguration in one of its data centers created havoc all over the internet.
Due to the technical nature of the outage — involving both firewall and BGP routing — the error spread outward from CenturyLink’s network and also impacted other internet service providers, ending up causing connectivity problems for many more other companies.
The list of tech giants who had services go down today because of the CenturyLink outage includes big names like Amazon, Twitter, Microsoft (Xbox Live), EA, Blizzard, Steam, Discord, Reddit, Hulu, Duo Security, Imperva, NameCheap, OpenDNS, and many more.
Cloudflare, which was also severely impacted today, said CenturyLink’s outward-propagating issue led to a 3.5% drop in global internet traffic, which would make this one of the biggest internet outages ever recorded.
Root cause: Misconfigured Flowspec rule
According to a CenturyLink status page, the issue originated from CenturyLink’s data center in Mississauga, a city near Ontario, Canada.
The telco says the root cause of the incident was an incorrect Flowspec announcement.
Flowspec is an extension for the BGP protocol that allows companies to use BGP routes to distribute firewall rules across their network. Flowspec announcements are usually used when dealing with security incidents, such as BGP hijacks or DDoS attacks, as it allows companies to change their entire network to react and mitigate attacks within seconds.
However, today, CenturyLink said that its Mississauga data center sent out an incorrect Flowspec announcement that effectively prevented the company’s BGP routes from taking root.
Cloudflare, which observed the incident from afar, believes CenturyLink effectively put its entire network into a loop by announcing a brand new set of BGP routes and then accidentally dropping all routes via the misconfigured Flowspec rule.
BGP routes are the glue that keeps the internet up. They are a type of message that internet companies relay between each other. BGP routes tell each internet provider which chunk of IP addresses is available on its network.
However, as CenturyLink’s incorrect Flowspec command brought down some of the routers inside its network, some of those routers also began to announce incorrect BGP routes to other “Tier 1” neighboring internet service.
This, in turn, brought down other networks in a domino-like effect.
Outage took seven hours to fix
CenturyLink fixed the issue by taking the rare step of telling all other Tier 1 internet providers to de-peer, and ignore any traffic coming from its network. Companies rarely take these kinds of decisions, as this results in full connectivity loss for all its customers.
All in all, CenturyLink had to reset all equipment and start with clean BGP routing tables, a process that took almost seven hours to complete, from around 12:13 UTC to 18:58 UTC, the company said.
“This was a significant global Internet outage,” said Matthew Prince, co-founder & CEO of Cloudflare, in his analysis of the outage.