Facebook on Tuesday blamed the massive outage that hit Instagram, WhatsApp and Messenger users globally for more than six hours on what it described as an engineering “error of our own making.”
The outage — which may have cost the company up to $100 million in lost revenue — was triggered when Facebook engineers were trying to conduct “a routine maintenance” job, Santosh Janardhan, Facebook’s vice president of infrastructure, wrote in a blog post.
The engineers issued a command “with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally,” he said.
And a tool that should have caught the mistake before it triggered outages was hindered by a bug that prevented it from intervening, he added.
“This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse,” Janardhan’s explanation goes on.
That initial issue triggered problems with Facebook’s DNS, or Domain Name System, which connects domain names to the right IP addresses so that people can access popular websites.
Earlier this year, an outage at a major DNS operator took out huge swaths of the internet briefly.
“The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers,” Janardhan said.
“All of this happened very fast.”
Facebook staffers were prevented from quickly responding to the outage because Facebook’s own internal security systems were affected, in some cases, locking employees out of important areas.
It was “not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this,” Janardhan said.
“So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.”
And even once the issue was identified and dealt with, Janardhan said, Facebook could not bring all of its systems back online at once because they might crash again due to a surge in traffic.
The company is reviewing what happened and looking for ways in which it could improve the process, he added.
“We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making,” he said.
“I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this.”
Published on: Article source