Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday January 05, @12:41PM   Printer-friendly [Skip to comment(s)]
from the BIG-oops! dept.

This HPE software update accidentally wiped 77TB of data:

We covered this story here University Loses Valuable Supercomputer Research After Backup Error Wipes 77 Terabytes of Data. I, like some others, suspected finger trouble on the part of those doing the backup, but the company writing the sofware have put their hands up and taken responsibility.

A flawed update sent out by Hewlett Packard Enterprise (HPE) resulted in the loss of 77TB of critical research data at Kyoto University, the company has admitted.

HPE recently issued a software update that broke a program deleting old log files, and instead of just deleting those (which would still have a backup copy stored in a high-capacity storage system), it deleted pretty much everything, including files in the backup system, Tom's Hardware reported.

As a result, some 34 million files, generated by 14 different research groups, from December 14 to December 16, were permanently lost.

In a press release, issued in Japanese, HPE took full responsibility for the disastrous mishap.


Original Submission

Related Stories

University Loses Valuable Supercomputer Research After Backup Error Wipes 77 Terabytes of Data 24 comments

University Loses Valuable Supercomputer Research After Backup Error Wipes 77 Terabytes of Data:

Kyoto University, a top research institute in Japan, recently lost a whole bunch of research after its supercomputer system accidentally wiped out a whopping 77 terabytes of data during what was supposed to be a routine backup procedure.

That malfunction, which occurred sometime between Dec. 14 and Dec. 16, erased approximately 34 million files belonging to 14 different research groups that had been using the school's supercomputing system. The university operates Hewlett Packard Cray computing systems and a DataDirect ExaScaler storage system—the likes of which can be utilized by research teams for various purposes.

It's unclear what kind of files were specifically deleted or what caused the actual malfunction, though the school has said that the work of at least four different groups will not be able to be restored.

Also at BleepingComputer.

Original announcement from the university.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2, Interesting) by Anonymous Coward on Wednesday January 05, @12:46PM (5 children)

    by Anonymous Coward on Wednesday January 05, @12:46PM (#1210094)

    How did that get past their lawyers and executives?

    • (Score: 5, Funny) by driverless on Wednesday January 05, @01:35PM (2 children)

      by driverless (4770) on Wednesday January 05, @01:35PM (#1210099)

      It's Japan. Not only did they issue a public apology but several HP managers involved in the problem committed seppuku shortly afterwards.

      • (Score: 0) by Anonymous Coward on Wednesday January 05, @05:39PM

        by Anonymous Coward on Wednesday January 05, @05:39PM (#1210184)

        with help.

      • (Score: 2) by krishnoid on Wednesday January 05, @06:06PM

        by krishnoid (1156) on Wednesday January 05, @06:06PM (#1210194)

        In that's the case, who was their second (source)?

    • (Score: 2) by Runaway1956 on Wednesday January 05, @01:35PM

      by Runaway1956 (2926) Subscriber Badge on Wednesday January 05, @01:35PM (#1210101) Homepage Journal

      It's Okay - they fired the cleaning lady responsible, and she committed hari kiri, absolving the company of any dishonor. It's going to work out.

      --
      “If everyone is thinking alike, then somebody isn't thinking.” ― George S. Patton on Ukraine
    • (Score: 0) by Anonymous Coward on Wednesday January 05, @01:39PM

      by Anonymous Coward on Wednesday January 05, @01:39PM (#1210103)

      Perhaps some things you see as universal truths are in fact cultural truths that don't work out the same way in a very different culture.

  • (Score: 5, Informative) by epitaxial on Wednesday January 05, @02:46PM (3 children)

    by epitaxial (3165) on Wednesday January 05, @02:46PM (#1210121)

    From the same people who brought you drives that shit the bed when they roll over 32768 hours. https://www.bleepingcomputer.com/news/hardware/hp-warns-that-some-ssd-drives-will-fail-at-32-768-hours-of-use/ [bleepingcomputer.com]

    • (Score: 5, Funny) by maxwell demon on Wednesday January 05, @05:59PM (2 children)

      by maxwell demon (1608) on Wednesday January 05, @05:59PM (#1210190) Journal

      So basically, HPE is short for Huge Problems Expected?

      --
      The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 0) by Anonymous Coward on Wednesday January 05, @10:59PM

        by Anonymous Coward on Wednesday January 05, @10:59PM (#1210323)

        Or perhaps High Performance Erasure

      • (Score: 0) by Anonymous Coward on Friday January 07, @09:33AM

        by Anonymous Coward on Friday January 07, @09:33AM (#1210797)

        Around here, we say it stands for "High Priced Equipment" because they are almost always in the top 2 for hardware cost when bidding.

  • (Score: 3, Interesting) by MrGuy on Wednesday January 05, @03:07PM (16 children)

    by MrGuy (1007) on Wednesday January 05, @03:07PM (#1210129)

    some 34 million files,

    Wow that’s a lot of files.

    generated by 14 different research groups,

    Boy that’s a significant number of folks

    from December 14 to December 16

    Wait, what? They lost two days worth of files?

    Sure. That’s annoying. It’s definitely a setback. And it absolutely shouldn’t have happened.

    But we’re not talking about years of work lost. Not months. Not even weeks. It’s two days.

    The headlines on this incident would have you believe this is the burning of the Lighthouse of Alexandria or the Sack of Baghdad level cultural loss. They throw up the big numbers to get the big clicks, while ignoring that this is an INCONVENIENCE. Not a tragedy.

    • (Score: 3, Funny) by MrGuy on Wednesday January 05, @03:16PM

      by MrGuy (1007) on Wednesday January 05, @03:16PM (#1210134)

      Sorry. Got my Civ wonders mixed up there. Obviously, meant the Great Library of Alexandria, not the lighthouse.

    • (Score: 3, Insightful) by Runaway1956 on Wednesday January 05, @03:29PM (7 children)

      by Runaway1956 (2926) Subscriber Badge on Wednesday January 05, @03:29PM (#1210138) Homepage Journal

      I'm not sure about your math, or your line of reasoning. It sounds like they not only lost all the work done on those particular days, but they lost all the backups of work leading up to those days? The article is not especially clear exactly what was lost, but if all the backups were erased, it's probably a lot more work than just a couple of days.

      So, I guess it all boils down to a rather simple question: How much time, money, and manpower will it take to replace and reproduce all the erased data? If they can "fix" the problem in less than a month, then not a really big deal. If the "fix" will take years, then it's a major problem. Let's remember that some research projects take years of planning, and funding from the university, from private enterprise, as well as government grants.

      --
      “If everyone is thinking alike, then somebody isn't thinking.” ― George S. Patton on Ukraine
      • (Score: 2) by krishnoid on Wednesday January 05, @06:14PM

        by krishnoid (1156) on Wednesday January 05, @06:14PM (#1210196)

        You're way ahead of me on this. I'm still trying to puzzle together "backup" and "erased". Like, "whoops, we reversed a flag on rsync" plus "we totally (accidentally) called a separate routine to overwrite old backups, some of which required operator action to remount".

      • (Score: 0) by Anonymous Coward on Wednesday January 05, @09:08PM (5 children)

        by Anonymous Coward on Wednesday January 05, @09:08PM (#1210271)

        whatever, real science is always repoducable, so how can anything of value been lost here?

        • (Score: 3, Insightful) by Kell on Wednesday January 05, @10:36PM (4 children)

          by Kell (292) on Wednesday January 05, @10:36PM (#1210320)

          Speaking as a research engineer: time. Time is valuable and you never get it back.

          --
          Scientists ask questions. Engineers solve problems.
          • (Score: 1) by Acabatag on Thursday January 06, @02:59AM (3 children)

            by Acabatag (2885) on Thursday January 06, @02:59AM (#1210402)

            What seems odd to me is that scientists let anybody from Eye-Tee with their crappy 'enterprise' systems near this much important scientific data. Weren't there toner cartridges to replace or a secretaries' keyboard to blow the crumbs out of?

            This was not ordinary data that the data janitors are charged with maintaining.

            • (Score: 3, Informative) by Kell on Thursday January 06, @06:23AM (2 children)

              by Kell (292) on Thursday January 06, @06:23AM (#1210459)

              This might surprise some, but we often don't have a choice how our data is hosted at our institutions. With various data integrity and handling mandates from funding agencies plus strictly limited grant budgets it's almost impossible to self-manage your own IT unless you're somewhere like CERN. At my institution even getting an Octoprint server that I can connect to from home to manage 3D prints is basically impossible because the IT people are funded independently of the academics who rely on them and simply don't give a shit.

              --
              Scientists ask questions. Engineers solve problems.
              • (Score: 2) by PiMuNu on Thursday January 06, @04:37PM

                by PiMuNu (3823) Subscriber Badge on Thursday January 06, @04:37PM (#1210553)

                I think even at CERN they have IT guys running the firewall/etc. (Although not as long ago as you might imagine, they had a bit of an "incident" when it turned out all of their controls software shared the same password)

                The folks running the cluster are probably more like research specialists into IT, but they are surely going to use some enterprise solution for managing many storage nodes rather than rolling their own hacky scripts.

              • (Score: 0) by Anonymous Coward on Saturday January 08, @05:29AM

                by Anonymous Coward on Saturday January 08, @05:29AM (#1211018)

                My only catch on this is that the affected storage was for their 3 different supercomputers. At all the places I've worked at and with, the long-term storage is located separately from the cluster storage. Furthermore, there is absolutely the expectation everywhere I have been or worked with to use both for their intended purposes.

    • (Score: 3, Interesting) by tangomargarine on Wednesday January 05, @04:06PM (6 children)

      by tangomargarine (667) on Wednesday January 05, @04:06PM (#1210151)

      some 34 million files,

      Yeah, like how they used to say "computer capable of storing ten million pieces of information" "so it's got a 10 MB hard drive"

      Wait, what? They lost two days worth of files?

      Sure. That’s annoying. It’s definitely a setback. And it absolutely shouldn’t have happened.

      But we’re not talking about years of work lost. Not months. Not even weeks. It’s two days.

      The headlines on this incident would have you believe this is the burning of the Lighthouse of Alexandria or the Sack of Baghdad level cultural loss.

      Depends on the context. My sister was doing experiments with mice for her PhD and she said that if something happened to her mice before the experiment was done, it would cost $5000 a mouse instead of $50 to replace them and start over, because the specific genetic breed she was using was no longer in demand from the breeding facility.

      Even aside from missing an obviously critical event ("we were studying the build-up and actual event of this supernova and we've still got the last 3 months but we lost the initial 2 days of the nova itself"), it's always possible that the data as a whole is worse without a continuous sampling.

      And it's 14 different research groups. So the odds are higher at least 1 of them was significantly impacted.

      The headlines on this incident would have you believe this is the burning of the Lighthouse of Alexandria or the Sack of Baghdad level cultural loss.

      Which of these headlines is implying that?

      This HPE software update accidentally wiped 77TB of data

      University Loses Valuable Supercomputer Research After Backup Error Wipes 77 Terabytes of Data

      "Valuable" is hardly comparing anything to the Library of Alexandria, and "77TB" is just a fact. It sounds like *you're* the one overreacting.

      --
      "Is that really true?" "I just spent the last hour telling you to think for yourself! Didn't you hear anything I said?"
      • (Score: 2) by maxwell demon on Wednesday January 05, @06:04PM

        by maxwell demon (1608) on Wednesday January 05, @06:04PM (#1210192) Journal

        "computer capable of storing ten million pieces of information" "so it's got a 10 MB hard drive"

        A piece of information could be a bit, in which case it would be more of a 1.44 MB floppy.

        --
        The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 2) by krishnoid on Wednesday January 05, @06:29PM (3 children)

        by krishnoid (1156) on Wednesday January 05, @06:29PM (#1210201)

        Average mass of adult mouse: 30g (on the heavier side)
        Cost of one gram of gold: ~$60
        Replacing one mouse: $5e3 / 30g x $60 = $1800 ~= 2.77 times its mass in gold.

        Sounds like a great investment. In a research lab though, would you buy mouse futures or insurance? It's not like they live that long, so you'd figure they'd either rebreed their own or pay the research breeder a little more to keep a line open for a couple years. I don't know how these "lines" are kept "open" when it comes to rodent husbandry, though.

        • (Score: 2) by tangomargarine on Wednesday January 05, @07:33PM (2 children)

          by tangomargarine (667) on Wednesday January 05, @07:33PM (#1210217)

          My numbers may not be accurate, but it was at least an order of magnitude.

          I think the issue is, if they have an active breeding facility at the supplier, all they need to do is dip into the cage to get some mice out, and ship them to you. When there isn't enough demand for a genetic line, they freeze some of its DNA for storage...so if you want that same strain again, they have to use petri dishes to "spin up" the breeding population again before it becomes self-sustaining, which will take longer and be more inconvenient for them.

          I think the strain she was working on may have affected the fertility of the mice involved, too.

          --
          "Is that really true?" "I just spent the last hour telling you to think for yourself! Didn't you hear anything I said?"
          • (Score: 2) by krishnoid on Wednesday January 05, @07:42PM

            by krishnoid (1156) on Wednesday January 05, @07:42PM (#1210224)

            Are you serious? Seems like a great B-story premise for a science-fiction epic, with humans instead of mice.

          • (Score: 1, Informative) by Anonymous Coward on Wednesday January 05, @11:04PM

            by Anonymous Coward on Wednesday January 05, @11:04PM (#1210325)

            That is about right. Getting a specific mouse line repeated can easily cost multiple orders of magnitude than ordering them in the first place. The suppliers can easily get you more popular one or the ones they have on hand, but if you go outside of that Oh boy can it get expensive. If they have to dip into their embryo stock, replicate the growth environment, and all the other work that goes into creating a mouse line, and they need to do it in a rush so your experiments aren't delayed/corrupted, and they don't have a large number of orders to spread out the cost, it adds up really quick per mouse.

      • (Score: 0) by Anonymous Coward on Friday January 07, @09:44AM

        by Anonymous Coward on Friday January 07, @09:44AM (#1210801)

        I had experience when working on a project with free electron laser our team got four nights to use the beam and boy it would be a shame to loose data from very expensive research equipment if some bug would wipe out our data. LHC at CERN maitenance is $3000000 a day on average, so two days might be very costly at research institutions.

  • (Score: 2) by Mojibake Tengu on Wednesday January 05, @04:23PM (1 child)

    by Mojibake Tengu (8598) Subscriber Badge on Wednesday January 05, @04:23PM (#1210161) Journal

    In a press release, issued in Japanese, HPE took full responsibility for the disastrous mishap.

    Now that's interesting news. I wish they did. Really. I'd gladly watch the ceremony.

    --
    The edge of 太玄 cannot be defined, for it is beyond every aspect of design
    • (Score: 2) by krishnoid on Wednesday January 05, @06:37PM

      by krishnoid (1156) on Wednesday January 05, @06:37PM (#1210204)

      Hell, I'd watch it even if it was symbolically simulated in computer graphics. Still gets a strong visual point across.

  • (Score: 3, Funny) by inertnet on Wednesday January 05, @10:35PM (1 child)

    by inertnet (4071) Subscriber Badge on Wednesday January 05, @10:35PM (#1210318)

    I'm surprised that apparently they didn't attempt to undelete whatever they could. I'm also surprised that such a high end backup system doesn't appear to have any undelete features.

    • (Score: 0) by Anonymous Coward on Wednesday January 05, @11:11PM

      by Anonymous Coward on Wednesday January 05, @11:11PM (#1210330)

      A high end backup system is a complex bunch of software. A bug free one should give you pretty good safety. Expecting/needing both bug free and complex in a place it doesn't have to be doesn't seem a good idea.

      Aside from the additional problem of being an especially juicy ransom target, offline and simple seems safer.

  • (Score: 3, Funny) by jasassin on Thursday January 06, @02:17AM

    by jasassin (3566) <jasassin@gmail.com> on Thursday January 06, @02:17AM (#1210395) Journal

    I thought I felt sick after losing my Final Fantasy 3 save… Sheesh.

    --
    jasassin@gmail.com Key fingerprint = 0644 173D 8EED AB73 C2A6 B363 8A70 579B B6A7 02CA
(1)