Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday January 06 2021, @03:27AM   Printer-friendly
from the bit-flip-out dept.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

There's nothing quite like some fun holiday-weekend reading as a fiery mailing list post by Linus Torvalds. The Linux creator is out with one of his classical messages, which this time is arguing over the importance of ECC memory and his opinion on how Intel's "bad policies" and market segmentation have made ECC memory less widespread.

Linus argues that error-correcting code (ECC) memory "absolutely matters" but that "Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting [to] do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners"."

Ian Cutress from AnandTech points out in a reply that AMD's Ryzen ECC support is not as solid as believed.

Related: Linus Torvalds: 'I'm Not a Programmer Anymore'
Linus Torvalds Rejects "Beyond Stupid" Intel Security Patch From Amazon Web Services
Linus Torvalds: Don't Hide Rust in Linux Kernel; Death to AVX-512
Linus Torvalds Doubts Linux will Get Ported to Apple M1 Hardware


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Informative) by RamiK on Wednesday January 06 2021, @09:32AM (6 children)

    by RamiK (1813) on Wednesday January 06 2021, @09:32AM (#1095553)

    The merits of ECC in general aren't the point. It was always useful and outright essential for productivity loads. It's why workstations and servers paid a premium for it. But while it was annoying, it was justified since ECC memory involved increase costs across the design for both motherboard (the memory controller hub on the northbridge), cpu and ram.

    But, things changed.

    Around 2011 the northbridge was assimilated into the CPU so there's no longer additional design and validation costs for ECC on the motherboard so long as it's the default. That left the CPU and memory.

    Then a couple of years ago AMD designed their memory controllers with ECC support built-in for every model proving it doesn't really cost anything extra to get it done on the CPU / memory hub side of things too.

    Finally, and that's where the relevant part of the rant comes in, the most recent memory production nodes ended up so noisy that memory manufacturers are being forced to use ECC internally on their controllers anyhow. They even put it into their standard specs. So, what's happening now is that all the chips (pardon the pan) are in place and there's nothing BoM wise preventing from mass market ECC adoption. That is, except for Intel's market segmentation...

    So, with AMD in the game, it's finally a fight worth fighting over for Linus. But that's only been true for the last couple of years really.

    --
    compiling...
    Starting Score:    1  point
    Moderation   +2  
       Interesting=1, Informative=1, Total=2
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 1) by shrewdsheep on Wednesday January 06 2021, @03:18PM (5 children)

    by shrewdsheep (5215) on Wednesday January 06 2021, @03:18PM (#1095637)

    If there is transparent ECC chekcing (;-) going on already, why does it have to be exposed explicitly? It would, at best, additionally check some part of the bus-transfer. I believe this would be a fine solution. The RAM controller can always expose error rates on a side channel. Could you reference some RAM modules with transparent, internal, ECC?

    • (Score: 3, Informative) by RS3 on Wednesday January 06 2021, @03:59PM

      by RS3 (6367) on Wednesday January 06 2021, @03:59PM (#1095658)

      Here are some chips: https://www.intelligentmemory.com/ECC-DRAM/DDR3/ [intelligentmemory.com]

      I'll research it some more for you and find out who uses them in DIMM modules.

      The next step will be to find out if ECC correction info / stats are available on the i2c bus.

    • (Score: 4, Interesting) by RamiK on Wednesday January 06 2021, @09:23PM (3 children)

      by RamiK (1813) on Wednesday January 06 2021, @09:23PM (#1095813)

      Could you reference some RAM modules with transparent, internal, ECC?

      To my knowledge, all DDR5 DIMMs will have on-die ECC error correction.

      why does it have to be exposed explicitly?

      It's discussed here:

      We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM.

      ( https://ieeexplore.ieee.org/document/7551405 [ieee.org] )

      --
      compiling...
      • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @10:03PM (2 children)

        by Anonymous Coward on Wednesday January 06 2021, @10:03PM (#1095847)

        Where is the dumbed down version? Car analogy?

        • (Score: 0) by Anonymous Coward on Friday January 08 2021, @02:01AM (1 child)

          by Anonymous Coward on Friday January 08 2021, @02:01AM (#1096796)

          To put some background and explain it in English because I couldn't come up with a good analogy.

          ECC is usually done as SECDED (single error correct, dual error detect). This means that all errors that are single bits can be corrected and all errors that are two bits can be detected but not necessarily fixed. Chipkill is a higher standard that is SECDED plus the additional guarantee that all errors introduced by a single memory chip regardless of size are correctable and result in a detectable error in the face of any other "single" errors on other chips. Basically, it makes it so that all errors on a single chip can be treated as a single bit error regardless of how big it actually is. Additionally, most errors that occur are caused by bad hardware (82- 92%) and not random errors. As a result, most errors are confined to a single chip.

          Now, under a regular ECC system dealing with purely random errors, whether or not the user actually sees the errors wouldn't make a difference. They pop up, are fixed, and there is nothing you can do about it anyway precisely because they are random. This is not the case with bad hardware. By not exposing errors to the OS and user that are detected, the system is unable reach the same ability to correct the situation of a bad chip as the Chipkill system. Therefore, you can only get the benefit of fixing random errors, which are the minority as mentioned. However, if you do signal the OS about the errors, they are able to replace bad chips. Therefore get the majority of the benefit of Chipkill (but not 100% because you still get errors until replacement occurs which is not the case with Chipkill) without the performance drawbacks.

          • (Score: 0) by Anonymous Coward on Friday January 08 2021, @02:09AM

            by Anonymous Coward on Friday January 08 2021, @02:09AM (#1096800)

            And of course I screwed it up by not proofreading. The OS can replace the bad DATA not chips while the user can replace bad chips.