*Sincere* apologies to everyone impacted by outages of Facebook powered services right now. We are experiencing networking issues and teams are working as fast as possible to debug and restore as fast as possible
— Mike Schroepfer (@schrep) October 4, 2021
In a lengthy statement, Facebook blamed the disruption on a "faulty configuration change". It seems users were not being directed to the correct place by the Domain Name System (DNS) - something controlled by Facebook. This has led most analysts to conclude that human error, or more likely a sequence of errors, led to the outage.
Facebook said in a statement: “Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication.
“This disruption to network traffic had a cascading effect on the way our data centres communicate, bringing our services to a halt."
Cloudflare, a DNS provider, shed further light on Twitter saying five minutes before the outage, it saw a series of updates to Facebook's book services. This indicates that the updates inadvertently led to the outage.
How can this be prevented from happening?
Facebook will no doubt have proper procedures in place to mitigate mistakes. It will certainly use version control and have an approval process for changes to the DNS. What it may not have prepared for, was all these failing due to poor decision-making. To solve this, they should look at it through the lens of human factors.
If your business has experienced technical human errors, IHF can help. Get in touch to learn more