Over at The Register Shaun Nichols has a cheeky take on what happened to Amazon S3 last week:
Amazon has provided the postmortem for Tuesday's AWS S3 meltdown, shedding light on what caused one of its largest cloud facilities to bring a chunk of the web down.
In a note today to customers, the tech giant said the storage system was knocked offline by a staffer trying to address a problem with its billing system. Essentially, someone mistyped a command within a production environment while debugging a performance gremlin.
"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," the team wrote in its message.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems."
Those two subsystems handled the indexing for objects stored on S3 and the allocation of new storage instances. Without these two systems operating, Amazon said it was unable to handle any customer requests for S3 itself, or those from services like EC2 and Lambda functions connected to S3.
As a result, websites small and large that relied on the cheap and popular Virginia US-East-1 region stopped working properly, costing hundreds of millions of dollars in losses for customers. It also broke smartphone apps and Internet of Things gadgets – from lightbulbs to Nest security cameras – that were relying on the S3 storage backend.
Do any of our Soylentils here use Amazon S3 and if so, were you impacted by the outage?
(Score: 2) by tynin on Sunday March 05 2017, @04:18AM (1 child)
• Redundant, uninterruptible power supplies, backed up by generators.
As someone who has played in the datacenter world for nearly two decades, I've worked in more than one DC that has power feeds from two different locations. Had massive rooms with rows and rows of batteries two stories high, on site generators to carry the load, and a building wide backup generator.
The number of times the ATS (automatic transfer switch) didn't fail over to the working power feed, where the batteries faulted due to wiring or plain tripped and started discharging while removing us from working power, where the on site generator failed to kick on, and the building wide generator happened to have a fuel filter issue... it boggles my mind. It is like gremlins are real, and their evil shenanigans are not to be stopped.
(Score: 1) by toddestan on Sunday March 05 2017, @06:21AM
Along the same lines, I've found that cheap home/office UPS's can also decrease your reliability. Or in other words problems caused by the cheap UPS will cause more interruptions than simply plugging into the wall and relying on the electric company. Of course that all depends on how reliable your electric service is.