Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday January 06 2021, @03:27AM   Printer-friendly
from the bit-flip-out dept.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

There's nothing quite like some fun holiday-weekend reading as a fiery mailing list post by Linus Torvalds. The Linux creator is out with one of his classical messages, which this time is arguing over the importance of ECC memory and his opinion on how Intel's "bad policies" and market segmentation have made ECC memory less widespread.

Linus argues that error-correcting code (ECC) memory "absolutely matters" but that "Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting [to] do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners"."

Ian Cutress from AnandTech points out in a reply that AMD's Ryzen ECC support is not as solid as believed.

Related: Linus Torvalds: 'I'm Not a Programmer Anymore'
Linus Torvalds Rejects "Beyond Stupid" Intel Security Patch From Amazon Web Services
Linus Torvalds: Don't Hide Rust in Linux Kernel; Death to AVX-512
Linus Torvalds Doubts Linux will Get Ported to Apple M1 Hardware


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @10:03PM (2 children)

    by Anonymous Coward on Wednesday January 06 2021, @10:03PM (#1095847)

    Where is the dumbed down version? Car analogy?

  • (Score: 0) by Anonymous Coward on Friday January 08 2021, @02:01AM (1 child)

    by Anonymous Coward on Friday January 08 2021, @02:01AM (#1096796)

    To put some background and explain it in English because I couldn't come up with a good analogy.

    ECC is usually done as SECDED (single error correct, dual error detect). This means that all errors that are single bits can be corrected and all errors that are two bits can be detected but not necessarily fixed. Chipkill is a higher standard that is SECDED plus the additional guarantee that all errors introduced by a single memory chip regardless of size are correctable and result in a detectable error in the face of any other "single" errors on other chips. Basically, it makes it so that all errors on a single chip can be treated as a single bit error regardless of how big it actually is. Additionally, most errors that occur are caused by bad hardware (82- 92%) and not random errors. As a result, most errors are confined to a single chip.

    Now, under a regular ECC system dealing with purely random errors, whether or not the user actually sees the errors wouldn't make a difference. They pop up, are fixed, and there is nothing you can do about it anyway precisely because they are random. This is not the case with bad hardware. By not exposing errors to the OS and user that are detected, the system is unable reach the same ability to correct the situation of a bad chip as the Chipkill system. Therefore, you can only get the benefit of fixing random errors, which are the minority as mentioned. However, if you do signal the OS about the errors, they are able to replace bad chips. Therefore get the majority of the benefit of Chipkill (but not 100% because you still get errors until replacement occurs which is not the case with Chipkill) without the performance drawbacks.

    • (Score: 0) by Anonymous Coward on Friday January 08 2021, @02:09AM

      by Anonymous Coward on Friday January 08 2021, @02:09AM (#1096800)

      And of course I screwed it up by not proofreading. The OS can replace the bad DATA not chips while the user can replace bad chips.