SoylentNews Comments | An Enterprise SSD Flaw Will Brick Hardware after Exactly 40,000 Hours

An Enterprise SSD Flaw Will Brick Hardware after Exactly 40,000 Hours

posted by janrinok on Friday March 27 2020, @09:03AM

from the you-don't-always-get-what-you-pay-for dept.

upstart writes in with an IRC submission for SoyCow8162:

An enterprise SSD flaw will brick hardware after exactly 40,000 hours:

Hewlett Packard Enterprise (HPE) has warned that certain SSD drives could fail catastrophically if buyers don't take action soon. Due to a firmware bug, the products in question will be bricked exactly 40,000 hours (four years, 206 days and 16 hours) after the SSD has entered service. "After the SSD failure occurs, neither the SSD nor the data can be recovered," the company warned in a customer service bulletin.
[...] The drives in question are 800GB and 1.6TB SAS models and storage products listed in the service bulletin here. It applies to any products with HPD7 or earlier firmware. HPE also includes instructions on how to update the firmware and check the total time on the drive to best plan an upgrade. According to HPE, the drives could start failing as early as October this year.

Original Submission

This discussion has been archived. No new comments can be posted.

An Enterprise SSD Flaw Will Brick Hardware after Exactly 40,000 Hours | Log In/Create an Account | Top | 72 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Re:Legendary Re:Legendary (Score: 1) by VacuumTube on Friday March 27 2020, @01:07PM (11 children)

by VacuumTube (7693) on Friday March 27 2020, @01:07PM (#976278) Journal

"I'm calling it stupid, not greedy."
Don't you consider engineering products to need replacement before they wear out to be a bit greedy? Why not?

Parent
Re:Legendary Re:Legendary (Score: 2) by takyon on Friday March 27 2020, @01:16PM (10 children)

by takyon (881) <takyonNO@SPAMsoylentnews.org> on Friday March 27 2020, @01:16PM (#976282) Journal

Well, if you take their word for it, it was the result of a bug. So, Hanlon's razor. Also, they made the warning before any bricking actually occurred, so there won't be any need for replacement unless businesses ignore this bulletin.

--
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]

Parent
- Re:Legendary (Score: 2) by VacuumTube on Friday March 27 2020, @06:41PM
  
  by VacuumTube (7693) on Friday March 27 2020, @06:41PM (#976406) Journal
  
  "So, Hanlon's razor. "
  You're right. We don't have enough evidence for a conviction, even if it is against HPE.
  
  Parent
- Re:Legendary (Score: 2) by RS3 on Saturday March 28 2020, @06:36AM
  
  by RS3 (6367) on Saturday March 28 2020, @06:36AM (#976575)
  
  I'll make the argument that bugs happen due to greed- pushing things out to customers before they're ready to be sold.
  
  Parent
- Re:Legendary Re:Legendary (Score: 2) by VacuumTube on Saturday March 28 2020, @10:40AM (7 children)
  
  by VacuumTube (7693) on Saturday March 28 2020, @10:40AM (#976598) Journal
  
  Since my previous post granting the benefit of a doubt to HPE, I've been bothered by the question of how a bug could conceivably cause a SSD to fail after a precise number of hours. It would have to be one that not only disables further operation, but goes the additional step of thoroughly wiping the data making recovery impossible. With due respect to Hanlon I have to say that it seems vanishingly unlikely that this could happen other than by design. So in accordance with Occam's razor it seems more likely to me that the only bug was in coding the failure to occur a few hours earlier than intended.
  
  Parent
  - Re:Legendary Re:Legendary (Score: 4, Informative) by takyon on Saturday March 28 2020, @11:11AM (6 children)
    
    by takyon (881) <takyonNO@SPAMsoylentnews.org> on Saturday March 28 2020, @11:11AM (#976599) Journal
    
    I did some more research. SanDisk (Western Digital) is getting fingered. Seems like they provided bugged code that was used by both HPE and Dell:
    HPE Warns of New Bug That Kills SSD Drives After 40,000 Hours [bleepingcomputer.com]
    The company says that this is a comprehensive list of impacted SSDs it makes available. However, the issue is not unique to HPE and may be present in drives from other manufacturers.
    [...] HPE learned about the firmware bug from a SSD manufacturer and warns that if SSDs were installed and put into service at the same time they are likely to fail almost concurrently.
    [...] Last month, Dell EMC released new firmware to correct a bug causing nine SanDisk SSDs in its portfolio to fail "after approximately 40,000 hours of usage."
    [...] The update corrects a check for logging the circular buffer index value. "Assert had a bad check to validate the value of circular buffer's index value. Instead of checking the max value as N, it checked for N-1," Dell's advisory [dell.com] explains.
    HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours [blocksandfiles.com]
    The company said in a bulletin that the “issue is not unique to HPE and potentially affects all customers that purchased these drives.” HPE has not identified the SSD maker and refused to do so, saying: “We’re not confirming manufacturers.”
    [...] It seems likely that the HPE drives are also SanDisks. Blocks & Files asked Western Digital, which acquired SanDisk in 2016, for comment. A company spokesperson said: “Per Western Digital corporate policy, we are unable to provide comments regarding other vendors’ products. As this falls within HPE’s portfolio, all related product questions would best be addressed with HPE directly.”
    
    --
    [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    
    Parent
    - Re:Legendary Re:Legendary (Score: 2) by VacuumTube on Saturday March 28 2020, @07:42PM (5 children)
      
      by VacuumTube (7693) on Saturday March 28 2020, @07:42PM (#976732) Journal
      
      Thanks for digging out this additional information, Takyon. At this point it does appear that HPE is one of the injured parties, and that they have done what they could to mitigate the problem. The one technical detail they gave out concerning a buffer index value doesn't say much to me, but they're probably constrained by NDAs.
      
      Parent
      - Re:Legendary Re:Legendary (Score: 2) by RS3 on Monday March 30 2020, @02:01AM (4 children)
        
        by RS3 (6367) on Monday March 30 2020, @02:01AM (#977083)
        
        It's conceivable, to me anyway, that a software bug could have any possible result, including writing to, erasing, scrambling block FLASH cell hash tables (thereby losing all data), etc.
        
        Parent
        
        Re:Legendary Re:Legendary (Score: 2) by VacuumTube on Tuesday March 31 2020, @08:18PM (3 children)
        
        by VacuumTube (7693) on Tuesday March 31 2020, @08:18PM (#977744) Journal
        
        Really? A single bug that would do all that plus cause the hardware to permanently fail? That's what I find intriguing about the subject, and perhaps it's just that I'm not very familiar with the hardware. But I can't recall ever seeing a software bug that caused such a complex failure in a delivered product.
        
        Parent
        
        Re:Legendary Re:Legendary (Score: 3, Interesting) by RS3 on Tuesday March 31 2020, @08:40PM (2 children)
        
        by RS3 (6367) on Tuesday March 31 2020, @08:40PM (#977756)
        
        Ahhh, you've never done assembly language?
        I'll pose a potential scenario: programming error bad value causes pc (program counter) to jump somewhere it doesn't belong, like the routine that writes to the FLASH memory. But, the pointers are not correct for this write, and the write_eeprom_now routine trashes the very code that runs the SSD, which then goes even more crazy, also writing to main storage FLASH, trashing the stored data. Once code trashes itself, there's no fix unless you have external computers to do cross-checking, like the multiple redundant computes sometimes used in mission-critical stuff, like Space Shuttle, etc., which obviously nobody does in an SSD.
        I'm not technically posing a true hardware failure here, and I didn't perceive that from TFS or TFA. However, to correct the bad control program (patch / update) on the SSD, the SSD has to be able to run well enough to receive and execute the controller's FLASH update routine. It's like a motherboard BIOS that goes bad and the MB is bricked.
        All that said, you could conceivably remove and re-flash the chip that stores the controller's programming... unless, it's internal to the controller's microprocessor, which is likely the case.
        Some microcontrollers have a pin which when driven high or low depending on the spec, will tell the uP to load from external ROM/FLASH chip and ignore the internal programming. Then it would be possible to re-flash the internal bad code, but obviously this would take a bit of hardware and technician's time. And again, if the original code went berserk and trashed the main FLASH data (your stored files) then it's all moot (unless you want to repair the drive for future use...)
        
        Parent
        
        Re:Legendary Re:Legendary (Score: 3, Interesting) by VacuumTube on Tuesday March 31 2020, @10:12PM (1 child)
        
        by VacuumTube (7693) on Tuesday March 31 2020, @10:12PM (#977811) Journal
        
        Actually I used to love programming in assembler, but that was long before the days of SSDs and I guess I don't think in those terms any more. So thanks for the thought experiment. It brought back fond memories.
        
        Parent
        
        Re:Legendary (Score: 2) by RS3 on Tuesday March 31 2020, @10:36PM
        
        by RS3 (6367) on Tuesday March 31 2020, @10:36PM (#977821)
        
        You're quite welcome. Come to think of it, I don't do much assembler these days either and I'm itching to get back into it. Maybe.
        Yeah, I don't know how you could prevent these kinds of disasters without doing really good testing, code reviews, etc. Unfortunately companies are run by MBAs who see QC as being costly / overhead / loss. And that aside, egos are usually a pretty big moat to cross.
        BTW, I'm a vacuum tube hacker (too) and your username reminds me of a couple of projects that I could be working on while we wait for the world to hopefully to return to normal, or whatever the new normal becomes...
        
        Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

An Enterprise SSD Flaw Will Brick Hardware after Exactly 40,000 Hours

Re:Legendary Re:Legendary (Score: 1) by VacuumTube on Friday March 27 2020, @01:07PM (11 children)

Re:Legendary Re:Legendary (Score: 2) by takyon on Friday March 27 2020, @01:16PM (10 children)

Re:Legendary (Score: 2) by VacuumTube on Friday March 27 2020, @06:41PM

Re:Legendary (Score: 2) by RS3 on Saturday March 28 2020, @06:36AM

Re:Legendary Re:Legendary (Score: 2) by VacuumTube on Saturday March 28 2020, @10:40AM (7 children)

Re:Legendary Re:Legendary (Score: 4, Informative) by takyon on Saturday March 28 2020, @11:11AM (6 children)

Re:Legendary Re:Legendary (Score: 2) by VacuumTube on Saturday March 28 2020, @07:42PM (5 children)

Re:Legendary Re:Legendary (Score: 2) by RS3 on Monday March 30 2020, @02:01AM (4 children)

Re:Legendary Re:Legendary (Score: 2) by VacuumTube on Tuesday March 31 2020, @08:18PM (3 children)

Re:Legendary Re:Legendary (Score: 3, Interesting) by RS3 on Tuesday March 31 2020, @08:40PM (2 children)

Re:Legendary Re:Legendary (Score: 3, Interesting) by VacuumTube on Tuesday March 31 2020, @10:12PM (1 child)

Re:Legendary (Score: 2) by RS3 on Tuesday March 31 2020, @10:36PM