Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Friday March 27 2020, @09:03AM   Printer-friendly
from the you-don't-always-get-what-you-pay-for dept.

An enterprise SSD flaw will brick hardware after exactly 40,000 hours:

Hewlett Packard Enterprise (HPE) has warned that certain SSD drives could fail catastrophically if buyers don't take action soon. Due to a firmware bug, the products in question will be bricked exactly 40,000 hours (four years, 206 days and 16 hours) after the SSD has entered service. "After the SSD failure occurs, neither the SSD nor the data can be recovered," the company warned in a customer service bulletin.

[...] The drives in question are 800GB and 1.6TB SAS models and storage products listed in the service bulletin here. It applies to any products with HPD7 or earlier firmware. HPE also includes instructions on how to update the firmware and check the total time on the drive to best plan an upgrade. According to HPE, the drives could start failing as early as October this year.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by VacuumTube on Saturday March 28 2020, @07:42PM (5 children)

    by VacuumTube (7693) on Saturday March 28 2020, @07:42PM (#976732) Journal

    Thanks for digging out this additional information, Takyon. At this point it does appear that HPE is one of the injured parties, and that they have done what they could to mitigate the problem. The one technical detail they gave out concerning a buffer index value doesn't say much to me, but they're probably constrained by NDAs.

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by RS3 on Monday March 30 2020, @02:01AM (4 children)

    by RS3 (6367) on Monday March 30 2020, @02:01AM (#977083)

    It's conceivable, to me anyway, that a software bug could have any possible result, including writing to, erasing, scrambling block FLASH cell hash tables (thereby losing all data), etc.

    • (Score: 2) by VacuumTube on Tuesday March 31 2020, @08:18PM (3 children)

      by VacuumTube (7693) on Tuesday March 31 2020, @08:18PM (#977744) Journal

      Really? A single bug that would do all that plus cause the hardware to permanently fail? That's what I find intriguing about the subject, and perhaps it's just that I'm not very familiar with the hardware. But I can't recall ever seeing a software bug that caused such a complex failure in a delivered product.

      • (Score: 3, Interesting) by RS3 on Tuesday March 31 2020, @08:40PM (2 children)

        by RS3 (6367) on Tuesday March 31 2020, @08:40PM (#977756)

        Ahhh, you've never done assembly language?

        I'll pose a potential scenario: programming error bad value causes pc (program counter) to jump somewhere it doesn't belong, like the routine that writes to the FLASH memory. But, the pointers are not correct for this write, and the write_eeprom_now routine trashes the very code that runs the SSD, which then goes even more crazy, also writing to main storage FLASH, trashing the stored data. Once code trashes itself, there's no fix unless you have external computers to do cross-checking, like the multiple redundant computes sometimes used in mission-critical stuff, like Space Shuttle, etc., which obviously nobody does in an SSD.

        I'm not technically posing a true hardware failure here, and I didn't perceive that from TFS or TFA. However, to correct the bad control program (patch / update) on the SSD, the SSD has to be able to run well enough to receive and execute the controller's FLASH update routine. It's like a motherboard BIOS that goes bad and the MB is bricked.

        All that said, you could conceivably remove and re-flash the chip that stores the controller's programming... unless, it's internal to the controller's microprocessor, which is likely the case.

        Some microcontrollers have a pin which when driven high or low depending on the spec, will tell the uP to load from external ROM/FLASH chip and ignore the internal programming. Then it would be possible to re-flash the internal bad code, but obviously this would take a bit of hardware and technician's time. And again, if the original code went berserk and trashed the main FLASH data (your stored files) then it's all moot (unless you want to repair the drive for future use...)

        • (Score: 3, Interesting) by VacuumTube on Tuesday March 31 2020, @10:12PM (1 child)

          by VacuumTube (7693) on Tuesday March 31 2020, @10:12PM (#977811) Journal

          Actually I used to love programming in assembler, but that was long before the days of SSDs and I guess I don't think in those terms any more. So thanks for the thought experiment. It brought back fond memories.

          • (Score: 2) by RS3 on Tuesday March 31 2020, @10:36PM

            by RS3 (6367) on Tuesday March 31 2020, @10:36PM (#977821)

            You're quite welcome. Come to think of it, I don't do much assembler these days either and I'm itching to get back into it. Maybe.

            Yeah, I don't know how you could prevent these kinds of disasters without doing really good testing, code reviews, etc. Unfortunately companies are run by MBAs who see QC as being costly / overhead / loss. And that aside, egos are usually a pretty big moat to cross.

            BTW, I'm a vacuum tube hacker (too) and your username reminds me of a couple of projects that I could be working on while we wait for the world to hopefully to return to normal, or whatever the new normal becomes...