Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Friday March 27 2020, @09:03AM   Printer-friendly
from the you-don't-always-get-what-you-pay-for dept.

An enterprise SSD flaw will brick hardware after exactly 40,000 hours:

Hewlett Packard Enterprise (HPE) has warned that certain SSD drives could fail catastrophically if buyers don't take action soon. Due to a firmware bug, the products in question will be bricked exactly 40,000 hours (four years, 206 days and 16 hours) after the SSD has entered service. "After the SSD failure occurs, neither the SSD nor the data can be recovered," the company warned in a customer service bulletin.

[...] The drives in question are 800GB and 1.6TB SAS models and storage products listed in the service bulletin here. It applies to any products with HPD7 or earlier firmware. HPE also includes instructions on how to update the firmware and check the total time on the drive to best plan an upgrade. According to HPE, the drives could start failing as early as October this year.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1) by VacuumTube on Friday March 27 2020, @01:07PM (11 children)

    by VacuumTube (7693) on Friday March 27 2020, @01:07PM (#976278) Journal

    "I'm calling it stupid, not greedy."

    Don't you consider engineering products to need replacement before they wear out to be a bit greedy? Why not?

  • (Score: 2) by takyon on Friday March 27 2020, @01:16PM (10 children)

    by takyon (881) <takyonNO@SPAMsoylentnews.org> on Friday March 27 2020, @01:16PM (#976282) Journal

    Well, if you take their word for it, it was the result of a bug. So, Hanlon's razor. Also, they made the warning before any bricking actually occurred, so there won't be any need for replacement unless businesses ignore this bulletin.

    --
    [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    • (Score: 2) by VacuumTube on Friday March 27 2020, @06:41PM

      by VacuumTube (7693) on Friday March 27 2020, @06:41PM (#976406) Journal

      "So, Hanlon's razor. "

      You're right. We don't have enough evidence for a conviction, even if it is against HPE.

    • (Score: 2) by RS3 on Saturday March 28 2020, @06:36AM

      by RS3 (6367) on Saturday March 28 2020, @06:36AM (#976575)

      I'll make the argument that bugs happen due to greed- pushing things out to customers before they're ready to be sold.

    • (Score: 2) by VacuumTube on Saturday March 28 2020, @10:40AM (7 children)

      by VacuumTube (7693) on Saturday March 28 2020, @10:40AM (#976598) Journal

      Since my previous post granting the benefit of a doubt to HPE, I've been bothered by the question of how a bug could conceivably cause a SSD to fail after a precise number of hours. It would have to be one that not only disables further operation, but goes the additional step of thoroughly wiping the data making recovery impossible. With due respect to Hanlon I have to say that it seems vanishingly unlikely that this could happen other than by design. So in accordance with Occam's razor it seems more likely to me that the only bug was in coding the failure to occur a few hours earlier than intended.

      • (Score: 4, Informative) by takyon on Saturday March 28 2020, @11:11AM (6 children)

        by takyon (881) <takyonNO@SPAMsoylentnews.org> on Saturday March 28 2020, @11:11AM (#976599) Journal

        I did some more research. SanDisk (Western Digital) is getting fingered. Seems like they provided bugged code that was used by both HPE and Dell:

        HPE Warns of New Bug That Kills SSD Drives After 40,000 Hours [bleepingcomputer.com]

        The company says that this is a comprehensive list of impacted SSDs it makes available. However, the issue is not unique to HPE and may be present in drives from other manufacturers.

        [...] HPE learned about the firmware bug from a SSD manufacturer and warns that if SSDs were installed and put into service at the same time they are likely to fail almost concurrently.

        [...] Last month, Dell EMC released new firmware to correct a bug causing nine SanDisk SSDs in its portfolio to fail "after approximately 40,000 hours of usage."

        [...] The update corrects a check for logging the circular buffer index value. "Assert had a bad check to validate the value of circular buffer's index value. Instead of checking the max value as N, it checked for N-1," Dell's advisory [dell.com] explains.

        HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours [blocksandfiles.com]

        The company said in a bulletin that the “issue is not unique to HPE and potentially affects all customers that purchased these drives.” HPE has not identified the SSD maker and refused to do so, saying: “We’re not confirming manufacturers.”

        [...] It seems likely that the HPE drives are also SanDisks. Blocks & Files asked Western Digital, which acquired SanDisk in 2016, for comment. A company spokesperson said: “Per Western Digital corporate policy, we are unable to provide comments regarding other vendors’ products. As this falls within HPE’s portfolio, all related product questions would best be addressed with HPE directly.”

        --
        [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
        • (Score: 2) by VacuumTube on Saturday March 28 2020, @07:42PM (5 children)

          by VacuumTube (7693) on Saturday March 28 2020, @07:42PM (#976732) Journal

          Thanks for digging out this additional information, Takyon. At this point it does appear that HPE is one of the injured parties, and that they have done what they could to mitigate the problem. The one technical detail they gave out concerning a buffer index value doesn't say much to me, but they're probably constrained by NDAs.

          • (Score: 2) by RS3 on Monday March 30 2020, @02:01AM (4 children)

            by RS3 (6367) on Monday March 30 2020, @02:01AM (#977083)

            It's conceivable, to me anyway, that a software bug could have any possible result, including writing to, erasing, scrambling block FLASH cell hash tables (thereby losing all data), etc.

            • (Score: 2) by VacuumTube on Tuesday March 31 2020, @08:18PM (3 children)

              by VacuumTube (7693) on Tuesday March 31 2020, @08:18PM (#977744) Journal

              Really? A single bug that would do all that plus cause the hardware to permanently fail? That's what I find intriguing about the subject, and perhaps it's just that I'm not very familiar with the hardware. But I can't recall ever seeing a software bug that caused such a complex failure in a delivered product.

              • (Score: 3, Interesting) by RS3 on Tuesday March 31 2020, @08:40PM (2 children)

                by RS3 (6367) on Tuesday March 31 2020, @08:40PM (#977756)

                Ahhh, you've never done assembly language?

                I'll pose a potential scenario: programming error bad value causes pc (program counter) to jump somewhere it doesn't belong, like the routine that writes to the FLASH memory. But, the pointers are not correct for this write, and the write_eeprom_now routine trashes the very code that runs the SSD, which then goes even more crazy, also writing to main storage FLASH, trashing the stored data. Once code trashes itself, there's no fix unless you have external computers to do cross-checking, like the multiple redundant computes sometimes used in mission-critical stuff, like Space Shuttle, etc., which obviously nobody does in an SSD.

                I'm not technically posing a true hardware failure here, and I didn't perceive that from TFS or TFA. However, to correct the bad control program (patch / update) on the SSD, the SSD has to be able to run well enough to receive and execute the controller's FLASH update routine. It's like a motherboard BIOS that goes bad and the MB is bricked.

                All that said, you could conceivably remove and re-flash the chip that stores the controller's programming... unless, it's internal to the controller's microprocessor, which is likely the case.

                Some microcontrollers have a pin which when driven high or low depending on the spec, will tell the uP to load from external ROM/FLASH chip and ignore the internal programming. Then it would be possible to re-flash the internal bad code, but obviously this would take a bit of hardware and technician's time. And again, if the original code went berserk and trashed the main FLASH data (your stored files) then it's all moot (unless you want to repair the drive for future use...)

                • (Score: 3, Interesting) by VacuumTube on Tuesday March 31 2020, @10:12PM (1 child)

                  by VacuumTube (7693) on Tuesday March 31 2020, @10:12PM (#977811) Journal

                  Actually I used to love programming in assembler, but that was long before the days of SSDs and I guess I don't think in those terms any more. So thanks for the thought experiment. It brought back fond memories.

                  • (Score: 2) by RS3 on Tuesday March 31 2020, @10:36PM

                    by RS3 (6367) on Tuesday March 31 2020, @10:36PM (#977821)

                    You're quite welcome. Come to think of it, I don't do much assembler these days either and I'm itching to get back into it. Maybe.

                    Yeah, I don't know how you could prevent these kinds of disasters without doing really good testing, code reviews, etc. Unfortunately companies are run by MBAs who see QC as being costly / overhead / loss. And that aside, egos are usually a pretty big moat to cross.

                    BTW, I'm a vacuum tube hacker (too) and your username reminds me of a couple of projects that I could be working on while we wait for the world to hopefully to return to normal, or whatever the new normal becomes...