Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday January 06 2021, @03:27AM   Printer-friendly
from the bit-flip-out dept.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

There's nothing quite like some fun holiday-weekend reading as a fiery mailing list post by Linus Torvalds. The Linux creator is out with one of his classical messages, which this time is arguing over the importance of ECC memory and his opinion on how Intel's "bad policies" and market segmentation have made ECC memory less widespread.

Linus argues that error-correcting code (ECC) memory "absolutely matters" but that "Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting [to] do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners"."

Ian Cutress from AnandTech points out in a reply that AMD's Ryzen ECC support is not as solid as believed.

Related: Linus Torvalds: 'I'm Not a Programmer Anymore'
Linus Torvalds Rejects "Beyond Stupid" Intel Security Patch From Amazon Web Services
Linus Torvalds: Don't Hide Rust in Linux Kernel; Death to AVX-512
Linus Torvalds Doubts Linux will Get Ported to Apple M1 Hardware


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Informative) by sjames on Wednesday January 06 2021, @07:45AM (10 children)

    by sjames (2882) on Wednesday January 06 2021, @07:45AM (#1095534) Journal

    You mean you had no problems that you are aware of. I maintain a number of machines with ECC used for simulations. They run just fine, but once in a blue moon, one of them will log a corrected memory error. You could run memtest daily and never happen to catch an error. Memtest is designed to catch failing hardware, not the occasional random bit flip. You'd have to continuously run memtest for months to actually catch that sort of error.

    Starting Score:    1  point
    Moderation   +3  
       Informative=3, Total=3
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 2, Interesting) by Anonymous Coward on Wednesday January 06 2021, @10:53AM

    by Anonymous Coward on Wednesday January 06 2021, @10:53AM (#1095565)

    IIRC Google testing showed that DRAM can expect 1 bit flip per gigabyte per month due to background radiation, regardless of brand or type.

  • (Score: 2) by RS3 on Wednesday January 06 2021, @03:22PM (8 children)

    by RS3 (6367) on Wednesday January 06 2021, @03:22PM (#1095638)

    Yes, I know about hardware, memtest, bit-flips, etc., but thanks for the general info.

    They run just fine, but once in a blue moon, one of them will log a corrected memory error.

    Which OS, and what software is logging, and where are the logs?

    I maintain both Linux and Windows servers and I've never seen a logged RAM error, so I'm wondering if I'm missing something- missing a log somewhere that I don't know about...

    • (Score: 3, Informative) by sjames on Wednesday January 06 2021, @07:00PM (3 children)

      by sjames (2882) on Wednesday January 06 2021, @07:00PM (#1095717) Journal

      CentOS. It shows up in /var/log/messages and on the console. Debian puts it in syslog. It can also be found in the system event log. I have seen it on Cisco and on Supermicro hardware. I know that not all MBs support reporting ECC issues to the kernel.

      • (Score: 2) by RS3 on Wednesday January 06 2021, @07:53PM (1 child)

        by RS3 (6367) on Wednesday January 06 2021, @07:53PM (#1095750)

        Thank you for that. [shuffles off to check logs...]

        Some CentOS here too. I check /var/log/messages, "dmesg", daily "logwatch" email, lots of goodies in /sys and /proc, but I've never seen a RAM error message. Obviously that doesn't mean it never happens, just that I've never seen it. Haven't been running the Open Manage software, but it might help. Had too many problems getting it to work well. Well, I do run arcconf but not the rest of it.

        • (Score: -1, Troll) by Anonymous Coward on Wednesday January 06 2021, @08:12PM

          by Anonymous Coward on Wednesday January 06 2021, @08:12PM (#1095764)

          and you're not running with ECC ram. he said it logged a correction. are you daft?

      • (Score: 2) by RS3 on Wednesday January 06 2021, @08:08PM

        by RS3 (6367) on Wednesday January 06 2021, @08:08PM (#1095761)

        Sorry- hit "submit" too soon (as I do too often...)

        Yeah, it may be a matter of MB driver support. Dell has fairly good Open Manage modules for some of their hardware, but I don't know the extent. I run it on the Windows servers, but was having SW problems with one so I disabled it... [shuffles off again...]

        Well, I ran the Dell Open Manage, checked the logs, and there was 1 ECC bit correction in September, and no other ECC or RAM messages in a year. I can live with that. :)

        I might look into running the Open Manage on the Linux servers. I don't run them in X mode, but have Windows workstations I remote into that I can display the output on (Cygwin/X, etc.)

    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @08:30PM (3 children)

      by Anonymous Coward on Wednesday January 06 2021, @08:30PM (#1095774)

      You have to enable Machine Check Exceptions (MCE) in your BIOS. Then you should see it in the system logs (Linux). On Debian 10 you need to install "collectd-core" iirc, don't know about other distributions. On Windows I presume the MCE's will show up in the logs as well but I don't have any (important) Windows boxes to check.
      One nice link I have in my bookmarks is http://mindofjim.blogspot.com/2010/01/i-see-see-ecc.html [blogspot.com]

      • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @09:27PM

        by Anonymous Coward on Wednesday January 06 2021, @09:27PM (#1095816)

        That package depends on the actual daemon for reporting it. Instead, you can use them directly with mcelog for older kernels and rasdaemon for newer kernels.

      • (Score: 2) by RS3 on Thursday January 07 2021, @05:43AM (1 child)

        by RS3 (6367) on Thursday January 07 2021, @05:43AM (#1096286)

        Thank you for all the good info. Actually there's no such setting in the BIOS (I checked), and I don't think I've ever seen that in a BIOS, but I'll keep an eye out for it. I'm one to get into BIOS on any machine I touch, and at least check settings, etc. Is MCE on by default if there's no MCE setting in the BIOS?

        • (Score: 0) by Anonymous Coward on Thursday January 07 2021, @10:37AM

          by Anonymous Coward on Thursday January 07 2021, @10:37AM (#1096395)

          Most have a toggle but some are always on and some are always off. The only way to know for sure is to check your MCE or RAS status in the sysfs after installing the correct daemon for your kernel.