Stories
Slash Boxes
Comments

SoylentNews is people

posted by CoolHand on Monday September 12 2016, @01:32PM   Printer-friendly
from the booms-and-bangs dept.

https://motherboard.vice.com/read/a-loud-sound-just-shut-down-a-banks-data-center-for-10-hours

ING Bank's main data center in Bucharest, Romania, was severely damaged over the weekend during a fire extinguishing test. In what is a very rare but known phenomenon, it was the loud sound of inert gas being released that destroyed dozens of hard drives. The site is currently offline and the bank relies solely on its backup data center, located within a couple of miles' proximity.

"The drill went as designed, but we had collateral damage", ING's spokeswoman in Romania told me, confirming the inert gas issue. Local clients were unable to use debit cards and to perform online banking operations on Saturday between 1PM and 11PM because of the test. "Our team is investigating the incident," she said.

The purpose of the drill was to see how the data center's fire suppression system worked. Data centers typically rely on inert gas to protect the equipment in the event of a fire, as the substance does not chemically damage electronics, and the gas only slightly decreases the temperature within the data center.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by MrGuy on Monday September 12 2016, @05:51PM

    by MrGuy (1007) on Monday September 12 2016, @05:51PM (#400842)

    You have a primary datacenter. You have a nearby backup datacenter. You're going to simulate a disaster at the primary data center.

    You don't switch active operations over to the secondary data center FIRST? You wait until AFTER something goes wrong with your test to move operations to the redundant center?

    I'm sure they didn't expect anything to fail when they did the test, but that's why you do the test - you're doing it in case something unexpected breaks. That's the whole point. If you knew for certain what would happen there would be no point in doing the test. And you have a redundant data center available that can handle the load.

    That seems like disaster recovery testing 101 to me.

    If the thinking is that their switchover process between datacenters is risky, or the standby center might not be able to handle the load, well, guess what - you've located a way bigger risk to your continuing operations than you're likely to learn from the fire system test. And, by the way, if your failover process is problematic, you're WAY better off failing over late at night the night before the test than you are having to do it "live" when all your processing is down and people are screaming at you to bring it back up, which is undoubtedly what was happening here...

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by gnampff on Monday September 12 2016, @06:15PM

    by gnampff (5658) on Monday September 12 2016, @06:15PM (#400849)

    Of course they dont switch over to the fallback before the test. What is the point of a fallback if you cant use it as fallback AFTER the crash? It is sitting there for exactly this purpose. And obviously their procedures were crap and need some tuning so their secondary system actually does what it is supposed to in an emergency.

  • (Score: 3, Funny) by VLM on Monday September 12 2016, @07:51PM

    by VLM (445) on Monday September 12 2016, @07:51PM (#400898)

    when all your processing is down and people are screaming at you to bring it back up, which is undoubtedly what was happening here

    An interesting story from the same dino pen that had a halon dump a quarter century ago (unrelated, however) is one of our clients demanded to do a disaster recovery test during a hurricane which pissed me off as a WAN operator because hey I'm busy with the REAL hurricane knocking out REAL circuits not your made up test, but I was informed by my counterpart that if anything went wrong (and nothing did go wrong) then they would be blaming the real world hurricane for any impact. I'm like "doncha know we're in the midwest not far from Chicago and the hurricane is 2000 miles away?" and all they say is "sshhhhhhh!"

    It is a stroke of brilliance which I've never gotten any later employer to implement.

    Our customer either tells their customers "Hey we did a disaster recovery test during hurricane WTF and everything went fine, we friggin rock" or "Oh so sorry about that outage but you know hurricane WTF was a severe storm and all that". Either way they win. Genius I tell you, genius...

    All the times in the last quarter century when I've upgraded IOS or made config changes at 2am or whatever, when instead I could have just waited for the next thunderstorm or blizzard.