SoylentNews Comments | Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

posted by martyb on Wednesday January 06 2021, @03:27AM

from the bit-flip-out dept.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

There's nothing quite like some fun holiday-weekend reading as a fiery mailing list post by Linus Torvalds. The Linux creator is out with one of his classical messages, which this time is arguing over the importance of ECC memory and his opinion on how Intel's "bad policies" and market segmentation have made ECC memory less widespread.
Linus argues that error-correcting code (ECC) memory "absolutely matters" but that "Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting [to] do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners"."

Ian Cutress from AnandTech points out in a reply that AMD's Ryzen ECC support is not as solid as believed.

Original Submission

This discussion has been archived. No new comments can be posted.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC | Log In/Create an Account | Top | 99 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Re:He's absolutely right Re:He's absolutely right (Score: 5, Informative) by sjames on Wednesday January 06 2021, @07:45AM (10 children)

by sjames (2882) on Wednesday January 06 2021, @07:45AM (#1095534) Journal

You mean you had no problems that you are aware of. I maintain a number of machines with ECC used for simulations. They run just fine, but once in a blue moon, one of them will log a corrected memory error. You could run memtest daily and never happen to catch an error. Memtest is designed to catch failing hardware, not the occasional random bit flip. You'd have to continuously run memtest for months to actually catch that sort of error.

Parent

Starting Score:	1		point
Moderation		+3
Informative=3, Total=3
Extra 'Informative' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		5

Re:He's absolutely right (Score: 2, Interesting) by Anonymous Coward on Wednesday January 06 2021, @10:53AM

by Anonymous Coward on Wednesday January 06 2021, @10:53AM (#1095565)

IIRC Google testing showed that DRAM can expect 1 bit flip per gigabyte per month due to background radiation, regardless of brand or type.

Parent
Re:He's absolutely right Re:He's absolutely right (Score: 2) by RS3 on Wednesday January 06 2021, @03:22PM (8 children)

by RS3 (6367) on Wednesday January 06 2021, @03:22PM (#1095638)

Yes, I know about hardware, memtest, bit-flips, etc., but thanks for the general info.
They run just fine, but once in a blue moon, one of them will log a corrected memory error.
Which OS, and what software is logging, and where are the logs?
I maintain both Linux and Windows servers and I've never seen a logged RAM error, so I'm wondering if I'm missing something- missing a log somewhere that I don't know about...

Parent
- Re:He's absolutely right Re:He's absolutely right (Score: 3, Informative) by sjames on Wednesday January 06 2021, @07:00PM (3 children)
  
  by sjames (2882) on Wednesday January 06 2021, @07:00PM (#1095717) Journal
  
  CentOS. It shows up in /var/log/messages and on the console. Debian puts it in syslog. It can also be found in the system event log. I have seen it on Cisco and on Supermicro hardware. I know that not all MBs support reporting ECC issues to the kernel.
  
  Parent
  - Re:He's absolutely right Re:He's absolutely right (Score: 2) by RS3 on Wednesday January 06 2021, @07:53PM (1 child)
    
    by RS3 (6367) on Wednesday January 06 2021, @07:53PM (#1095750)
    
    Thank you for that. [shuffles off to check logs...]
    Some CentOS here too. I check /var/log/messages, "dmesg", daily "logwatch" email, lots of goodies in /sys and /proc, but I've never seen a RAM error message. Obviously that doesn't mean it never happens, just that I've never seen it. Haven't been running the Open Manage software, but it might help. Had too many problems getting it to work well. Well, I do run arcconf but not the rest of it.
    
    Parent
    - Comment Below Threshold
      
      Re:He's absolutely right (Score: -1, Troll) by Anonymous Coward on Wednesday January 06 2021, @08:12PM
      
      by Anonymous Coward on Wednesday January 06 2021, @08:12PM (#1095764)
      
      and you're not running with ECC ram. he said it logged a correction. are you daft?
      
      Parent
  - Re:He's absolutely right (Score: 2) by RS3 on Wednesday January 06 2021, @08:08PM
    
    by RS3 (6367) on Wednesday January 06 2021, @08:08PM (#1095761)
    
    Sorry- hit "submit" too soon (as I do too often...)
    Yeah, it may be a matter of MB driver support. Dell has fairly good Open Manage modules for some of their hardware, but I don't know the extent. I run it on the Windows servers, but was having SW problems with one so I disabled it... [shuffles off again...]
    Well, I ran the Dell Open Manage, checked the logs, and there was 1 ECC bit correction in September, and no other ECC or RAM messages in a year. I can live with that. :)
    I might look into running the Open Manage on the Linux servers. I don't run them in X mode, but have Windows workstations I remote into that I can display the output on (Cygwin/X, etc.)
    
    Parent
- Re:He's absolutely right Re:He's absolutely right (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @08:30PM (3 children)
  
  by Anonymous Coward on Wednesday January 06 2021, @08:30PM (#1095774)
  
  You have to enable Machine Check Exceptions (MCE) in your BIOS. Then you should see it in the system logs (Linux). On Debian 10 you need to install "collectd-core" iirc, don't know about other distributions. On Windows I presume the MCE's will show up in the logs as well but I don't have any (important) Windows boxes to check.
  One nice link I have in my bookmarks is http://mindofjim.blogspot.com/2010/01/i-see-see-ecc.html [blogspot.com]
  
  Parent
  - Re:He's absolutely right (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @09:27PM
    
    by Anonymous Coward on Wednesday January 06 2021, @09:27PM (#1095816)
    
    That package depends on the actual daemon for reporting it. Instead, you can use them directly with mcelog for older kernels and rasdaemon for newer kernels.
    
    Parent
  - Re:He's absolutely right Re:He's absolutely right (Score: 2) by RS3 on Thursday January 07 2021, @05:43AM (1 child)
    
    by RS3 (6367) on Thursday January 07 2021, @05:43AM (#1096286)
    
    Thank you for all the good info. Actually there's no such setting in the BIOS (I checked), and I don't think I've ever seen that in a BIOS, but I'll keep an eye out for it. I'm one to get into BIOS on any machine I touch, and at least check settings, etc. Is MCE on by default if there's no MCE setting in the BIOS?
    
    Parent
    - Re:He's absolutely right (Score: 0) by Anonymous Coward on Thursday January 07 2021, @10:37AM
      
      by Anonymous Coward on Thursday January 07 2021, @10:37AM (#1096395)
      
      Most have a toggle but some are always on and some are always off. The only way to know for sure is to check your MCE or RAS status in the sysfs after installing the correct daemon for your kernel.
      
      Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Re:He's absolutely right Re:He's absolutely right (Score: 5, Informative) by sjames on Wednesday January 06 2021, @07:45AM (10 children)

Re:He's absolutely right (Score: 2, Interesting) by Anonymous Coward on Wednesday January 06 2021, @10:53AM

Re:He's absolutely right Re:He's absolutely right (Score: 2) by RS3 on Wednesday January 06 2021, @03:22PM (8 children)

Re:He's absolutely right Re:He's absolutely right (Score: 3, Informative) by sjames on Wednesday January 06 2021, @07:00PM (3 children)

Re:He's absolutely right Re:He's absolutely right (Score: 2) by RS3 on Wednesday January 06 2021, @07:53PM (1 child)

Comment Below Threshold

Re:He's absolutely right (Score: -1, Troll) by Anonymous Coward on Wednesday January 06 2021, @08:12PM

Re:He's absolutely right (Score: 2) by RS3 on Wednesday January 06 2021, @08:08PM

Re:He's absolutely right Re:He's absolutely right (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @08:30PM (3 children)

Re:He's absolutely right (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @09:27PM

Re:He's absolutely right Re:He's absolutely right (Score: 2) by RS3 on Thursday January 07 2021, @05:43AM (1 child)

Re:He's absolutely right (Score: 0) by Anonymous Coward on Thursday January 07 2021, @10:37AM