Stories
Slash Boxes
Comments

SoylentNews is people

posted by on Saturday March 04 2017, @04:56AM   Printer-friendly
from the who-among-us-can-cast-the-first-stone? dept.

Over at The Register Shaun Nichols has a cheeky take on what happened to Amazon S3 last week:

Amazon has provided the postmortem for Tuesday's AWS S3 meltdown, shedding light on what caused one of its largest cloud facilities to bring a chunk of the web down.

In a note today to customers, the tech giant said the storage system was knocked offline by a staffer trying to address a problem with its billing system. Essentially, someone mistyped a command within a production environment while debugging a performance gremlin.

"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," the team wrote in its message.

"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems."

Those two subsystems handled the indexing for objects stored on S3 and the allocation of new storage instances. Without these two systems operating, Amazon said it was unable to handle any customer requests for S3 itself, or those from services like EC2 and Lambda functions connected to S3.

As a result, websites small and large that relied on the cheap and popular Virginia US-East-1 region stopped working properly, costing hundreds of millions of dollars in losses for customers. It also broke smartphone apps and Internet of Things gadgets – from lightbulbs to Nest security cameras – that were relying on the S3 storage backend.

Do any of our Soylentils here use Amazon S3 and if so, were you impacted by the outage?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Saturday March 04 2017, @03:38PM (3 children)

    by Anonymous Coward on Saturday March 04 2017, @03:38PM (#474893)
    If you or someone you know is a student at College of the Redwoods, then you may have heard that their unsinkable cloud-based Canvas application was down yesterday, Tuesday, the last day of February - impacting thousands of college students.

    The problem, it is said, was traced to Amazon Web Services' S3-based storage service, specifically, a storage facility that served the entire East Coast, located somewhere in the Southeast - Georgia, i think.

    Lesson #1: The Canvas service provider did not geolocate resources, IE, if they used Amazon's S3 service then they should have selected a West Coast storage facility. The service provider put profits ahead of service.

    Lesson #2: The Canvas service provider did not mirror resources, IE, if they used Amazon's S3 service - which offers high availablity, but charges extra - then they must have been using the very cheapest service available from S3. Again, the service provider put profits ahead of service.

    Lesson #3: The Canvas service provider did not use "the cloud" correctly. In theory a cloud-based service cannot be interrupted. In practice, we see that it can.

    Lesson #4: Someone in the food chain of command was not very diligent. Where your entire business depends upon a specific service being available, that is a dependency, and in my opinion anything you depend upon should be sited locally where you can see it and take responsibility for it - not off in 'the cloud' somewhere where you have no idea where it is or how to fix it and can only call a support number that, odds are, connects you to someone in India, where all you can do is demand a refund. No, thanks!

    College of the Redwoods may not be a business - but it is an economic venture that is required to add value to its students' lives, and to the community that hosts it - and when the value that it adds is dependent upon a service that is hosted in a cloud somewhere on the East Coast, we have gone badly off course.


    How does one make a cloud-based service un-interruptible? The same way one makes any other information technology infrastructure un-interruptible:

    • Redundant, uninterruptible power supplies, backed up by generators.
    • Redundant networks.
    • Redundant network interfaces.
    • More redundant power supplies in each computer.
    • Redundant CPUs.
    • Redundant disk drives.
    • High availability filesystems.
    • Storage area networks and network-attached storage.

    And all of these services are available at your local data center - or your remote data center - and then, if there's a problem, you are in control - and can decide if it's cost-effective to implement additional measures, yourself. Many data centers will even act as your hands and push buttons and replace parts and follow instructions, so that you don't even need to be there.

    What was the REAL cause? That's what I always want to know.

    It seems possible that the Amazon Web Services division found it expedient to lease facilities - maybe entire buildings - at other companies' existing data centers - and then exposed their infrastructure to failure by allowing their infrastructure to depend upon services provided by their host, which failed.

    Equally likely is that AWS, or their host, employed a contract employee instead of a permanent employee and that as a result of the constant churn of employees in the management of the IT infrastructure, some critical bit of information was not relayed to the new employee - one of the unfortunate consequences of today's newest fad, terminating employees via text message.


    So much of this is self-inflicted that it is hard to know where to start in recommending a corrective action.

    But it's clear that there is a screaming lack of good technologists.

    And it's here that I want to bring it back home. I don't want to cloud your understanding of the situation. Just the opposite.

    The people I meet coming out of College of the Redwoods seem kind of lackluster - they are, at best, dissatisfied Microsoft fans, in no small part because that is all they have known for the past few years of education.

    It would be nice if College of the Redwoods had a class on Amazon Web Services ... and another class, on Linux ... and maybe another class, on Linux systems administration ... so that graduates had half a chance of finding employment outside of Humboldt County - the government of which is largely Microsoft-based.

    It would also be nice if the teachers of the software-centric classes were all encouraged to consider making Youtube-grade videos of their lessons, too.

    I say this because the teachers tend to move their mouses very quickly and so it is not always easy to see what they did, leading to the teacher repeating it or the student missing it.

    Many Youtube videos focus upon the screen from the user's perspective - and this, along with the ability to stop, back up, and replay, makes for a superior learning experience.

    I understand how the teachers might be concerned that they will be replaced by a video, and end up unemployed ... but every year there is a new version of software, and you are still needed, to support the students' learning experience, and to extend the courseware - so there will still be quality work crying out to be done, by good teachers.

    I understand how the college might be concerned about their videos being pirated - but they don't have to host the videos on Youtube - they could embed watermarks, host the videos themselves (just don't host it in 'the cloud' :), cultivate the expertise required to run video hosting services locally, maybe help spawn a local business or two ... and make a boatload of money suing the people who pirate the videos, anyway, watermarks and all.


    I think I still have a syllabus kicking around from a class on Linux systems administration, using Red Hat Linux, if there's interest. It would be straightforward to use it as a basis for a newer, up-to-date syllabus on any other Linux, and then add material, as needed.


    (For some strange reason, either Craigslist staff or Humboldt County government took this posting down from the Humboldt County Craigslist server almost immediately after it was posted, and after it was reposted, it was taken down, quickly, again. So we are posting it here - apparently no one in Humboldt County is allowed to be informed on this topic.)
  • (Score: 2, Funny) by Ethanol-fueled on Saturday March 04 2017, @06:07PM

    by Ethanol-fueled (2792) on Saturday March 04 2017, @06:07PM (#474956) Homepage

    Get a job, you fucking hippie.

    -- San Diego

  • (Score: 2) by tynin on Sunday March 05 2017, @04:18AM (1 child)

    by tynin (2013) on Sunday March 05 2017, @04:18AM (#475129) Journal

    • Redundant, uninterruptible power supplies, backed up by generators.

    As someone who has played in the datacenter world for nearly two decades, I've worked in more than one DC that has power feeds from two different locations. Had massive rooms with rows and rows of batteries two stories high, on site generators to carry the load, and a building wide backup generator.

    The number of times the ATS (automatic transfer switch) didn't fail over to the working power feed, where the batteries faulted due to wiring or plain tripped and started discharging while removing us from working power, where the on site generator failed to kick on, and the building wide generator happened to have a fuel filter issue... it boggles my mind. It is like gremlins are real, and their evil shenanigans are not to be stopped.

    • (Score: 1) by toddestan on Sunday March 05 2017, @06:21AM

      by toddestan (4982) on Sunday March 05 2017, @06:21AM (#475168)

      Along the same lines, I've found that cheap home/office UPS's can also decrease your reliability. Or in other words problems caused by the cheap UPS will cause more interruptions than simply plugging into the wall and relying on the electric company. Of course that all depends on how reliable your electric service is.