Stories
Slash Boxes
Comments

SoylentNews is people

posted by takyon on Tuesday August 09 2016, @03:34PM   Printer-friendly
from the you're-grounded dept.

Cringley speculates like hell:

Delta Airlines last night suffered a major power outage at its data center in Atlanta that led to a systemwide shutdown of its computer network, stranding airliners and canceling flights all over the world. You already know that. What you may not know, however, is the likely role in the crisis of IT outsourcing and offshoring.

Do any Soylentils have inside/better information?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Informative) by Anonymous Coward on Tuesday August 09 2016, @05:53PM

    by Anonymous Coward on Tuesday August 09 2016, @05:53PM (#385875)

    This is what I wrote yesterday on the other site.

    Having had my hands in designing a few of these sorts of 'never go down' systems it is not 'easy' either as slapping a few bits of cisco/hp/dell kit together and calling it a day.

    First off you need a minimum of 2x the floor space in a min 2 different very different geographic locations.
    Second you need a min 2x the hardware at both locations. Oh and make sure you DCs can have fail over power and separate power systems.
    You need 2x the number of people running it. 1 set for each location. Support 'can' remote in to each location, preferably onsite but remote location 'can' work but wears out your support staff.
    Next you need to design the system to be able to handle what we called 'split brain'. Where half your data is out of the wrong data stores and the system is cross pointing to the wrong data centers. That takes time and proper design of the software, hardware, and network infrastructure.
    Your software and QA guys need to have their own sets to play with to make sure. Preferably two sets each. Does not have to be geo redundant. Just virtual redundant to simulate.
    Oh and your external network better be able to handle it. So that means playing with the proper providers of ISP networks, AND managing them and holding them accountable to fuckups. Plan on them not being up to task and you have to be redundant.
    Dont forget your upgrade plans. How will you fail between systems while you upgrade in place (both hardware, firmware, and software). Oh and your QA should be testing that plan as well.
    Also to your end customers and employees? It looks totally transparent. So you better have a decent network guys and load balancer guys.
    Also be sure to *TEST* your fail over systems. Do they actually fail over? Do they actually come back up? Do they actually get the right data from the right place? It is not just enough to have them. You want to make sure all of your plans actually work. You can even do it during the day, instead of 3AM on a sunday in a massive conference call. Oh and does the system fail properly while people are using it?

    Last of all you *need* and *want* to make sure your VP and up management is 100% on board. If they are not *none* of the junk above matters. Dont bother and find another job as it will never be funded correctly.

    It takes a fairly seasoned hand build systems like this. Your fresh off the plane (hehe) h1b probably will not cut it no mater how much smoke the temp agency blows up your ass. You can in place train them. But expect it to take time. Last time I did this it took about 2 months to put the hardware together (and it was a smallish system of 40 or so racks). It took another year and a half to work out all the bugs and procedures. Who gets called when. Who replaces what when. How is software upgraded. How does QA sign off on it etc.

    Some of the newer techs like docker, vmware, nosql can help mitigate some of these issues. But not all of them. You need to test them and find the holes. So you can either mitigate them or minimize them.

    Also you can outsource but remember they dont 'own' it. You do. All they care about is getting the contract complete. That does not mean a working viable system.

    That is the sort of system you want to build for a thing like this. Your customers and your fellow employees *expect* it to 'just work'.

    I feel for the dudes at that atlanta data center. Return to service is just the first step. "does not happen again" is the next step and that takes a lot of humility and fortitude to make it happen. It also means not hiding things that are wrong and being a right bastard to 'fix it'.

    Starting Score:    0  points
    Moderation   +5  
       Informative=5, Total=5
    Extra 'Informative' Modifier   0  

    Total Score:   5  
  • (Score: 2) by Thexalon on Tuesday August 09 2016, @07:13PM

    by Thexalon (636) on Tuesday August 09 2016, @07:13PM (#385916)

    I feel for the dudes at that atlanta data center. Return to service is just the first step. "does not happen again" is the next step and that takes a lot of humility and fortitude to make it happen. It also means not hiding things that are wrong and being a right bastard to 'fix it'.

    That presumes, of course, that they really care. If they don't really care, they'll find an unpopular junior admin, blame the whole thing on that one person, fire them, and tell upper management that the root cause was that hapless schlemazel and is now fixed.

    --
    The only thing that stops a bad guy with a compiler is a good guy with a compiler.
  • (Score: 3, Insightful) by darkfeline on Wednesday August 10 2016, @04:53AM

    by darkfeline (1030) on Wednesday August 10 2016, @04:53AM (#386125) Homepage

    I guarantee you Delta can afford to pay for it. Take a chunk out of the CEO's umbrella.

    --
    Join the SDF Public Access UNIX System today!
    • (Score: 1) by redneckmother on Wednesday August 10 2016, @05:13AM

      by redneckmother (3597) on Wednesday August 10 2016, @05:13AM (#386129)

      Sorry - bad moderation... showed up as "spam", but intended as "insightful". Dunno why that happened... Admins?

      --
      Mas cerveza por favor.
      • (Score: 2) by The Mighty Buzzard on Wednesday August 10 2016, @10:09AM

        by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Wednesday August 10 2016, @10:09AM (#386201) Homepage Journal

        Hit End or Page Down while the moderation dropdown has focus and it goes to the bottom of the list, Spam. Taken care of.

        --
        My rights don't end where your fear begins.
        • (Score: 0) by Anonymous Coward on Wednesday August 10 2016, @12:19PM

          by Anonymous Coward on Wednesday August 10 2016, @12:19PM (#386228)

          Maybe you could add another dashed line below "spam"? It should be a bit more difficult to accidentally pick :)