Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday January 06 2021, @03:27AM   Printer-friendly
from the bit-flip-out dept.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

There's nothing quite like some fun holiday-weekend reading as a fiery mailing list post by Linus Torvalds. The Linux creator is out with one of his classical messages, which this time is arguing over the importance of ECC memory and his opinion on how Intel's "bad policies" and market segmentation have made ECC memory less widespread.

Linus argues that error-correcting code (ECC) memory "absolutely matters" but that "Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting [to] do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners"."

Ian Cutress from AnandTech points out in a reply that AMD's Ryzen ECC support is not as solid as believed.

Related: Linus Torvalds: 'I'm Not a Programmer Anymore'
Linus Torvalds Rejects "Beyond Stupid" Intel Security Patch From Amazon Web Services
Linus Torvalds: Don't Hide Rust in Linux Kernel; Death to AVX-512
Linus Torvalds Doubts Linux will Get Ported to Apple M1 Hardware


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Informative) by Anonymous Coward on Wednesday January 06 2021, @08:49AM (4 children)

    by Anonymous Coward on Wednesday January 06 2021, @08:49AM (#1095546)

    The actual error rate is something like one bit every couple of decades or so. Which is why people's computers don't crash constantly. If you had 71 errors every day, your computer would crash regularly, often several times a day. Even if you aren't running Windows. The fact that this doesn't happen proves that that number is ridiculous. I have about one crash per year and I overclock. Lots of computers never crash. There are Linux systems out there with uptime above a decade.

    For real discussion on the subject (as opposed to a Wikipedia interpretation of a sensationalist journalist's misreading and hyping of a study that actually drew the opposite conclusion) see, for example, here [reddit.com] or Google's actual study here [toronto.edu]. What they found is that while there are large numbers of errors, they are concentrated in about 1-2% of the DIMMs. In other words, while a typical home user might get a bad stick of RAM and have to replace it, a bad stick of RAM in an ECC system turns into an ongoing stream of (mostly correctable) errors to the tune of thousands per day. What's more, the "bad" DIMMs are also highly correlated with machines, so there are a lot of these errors that are actually marginal motherboards or CPUs that end up getting corrected as well.

    From the actual study:

    8% of DIMMs in our fleet saw at least one correctable error per year

    And that's including the bad ones that a home user would simply replace.

    So Linus, as seems to be the norm these days, is just wrong. ECC is bad for home users because it's much slower. Like 30% slower. Datacenters use ECC because they have server CPUs with huge caches and eight memory channels that tolerate slow RAM, need all the uptime they can get, and can't afford to spend hours troubleshooting RAM problems. Home users aren't datacenters! Home users can afford to fiddle around swapping DIMMs!

    Rowhammer isn't relevant. Sure, the hardware is supposed to always correctly execute legal code, and Rowhammer code is legal code. So are side channel attacks, Meltdown and Spectre, and basically every security threat faced by modern computing. Programmers have finally gotten good at not writing buffer overflows using languages that aren't susceptible to buffer overflows, so the security researchers are getting creative. That's good! Security got better. You still have to do it. DDR4 mitigated Rowhammer, and DDR5 is supposed to mitigate it some more. It's only really a problem for DDR3... and ECC doesn't even prevent it!

    Starting Score:    0  points
    Moderation   +3  
       Interesting=1, Informative=2, Total=3
    Extra 'Informative' Modifier   0  

    Total Score:   3  
  • (Score: 2) by Immerman on Wednesday January 06 2021, @02:32PM (3 children)

    by Immerman (3985) on Wednesday January 06 2021, @02:32PM (#1095619)

    >> 8% of DIMMs in our fleet saw at least one correctable error per year
    >And that's including the bad ones that a home user would simply replace.

    Oh? And how would the typical home user know that it's memory to blame? Unless your bit-flip error rate is causing at least several bit flips per hour, you have basically zero chance of spotting the problem with a RAM test.

    • (Score: 2) by barbara hudson on Wednesday January 06 2021, @03:44PM (2 children)

      by barbara hudson (6443) <barbara.Jane.hudson@icloud.com> on Wednesday January 06 2021, @03:44PM (#1095647) Journal

      Most errors are meaningless. If you've got 8?gig of ram but are using only 1 gig, most of the errors on dram refresh will not do anything since when you go to use that ram, the first thing you do is overwrite it or zero it out. Any previous flipped bits are irrelevant.

      Most of the code you load never gets exercised, so again, bit flips there are irrelevant. Even much of the data can be bit-flipped without changing anything. After all, you don't rewrite every row in a database when you make a change to one field. And who cares about a single pixel in a 50 meg image? You're going to have more artifacts when you save it in a compressed format anyway.

      And of course, if a bit is flipped in a files memory ifagw after you, the human, have finished reading it, and you don't save it because you, the human, didn't change anything and the software hasn't flagged the file as modified by you, who cares?

      Because most of the code you load never runs - executables have plenty of slack space in them, dead code, functions from libraries that will be run once in a blue moon, etc. Linus knows this. But I guess he's getting old and a bit irrelevant.

      --
      SoylentNews is social media. Says so right in the slogan. Soylentnews is people, not tech.
      • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @06:39PM

        by Anonymous Coward on Wednesday January 06 2021, @06:39PM (#1095706)

        Your damaged brain had its bits flipped right out! You can't even tell that you are really a man, not a deranged wannabe chemically castrated woman.

      • (Score: 2) by Immerman on Wednesday January 06 2021, @08:29PM

        by Immerman (3985) on Wednesday January 06 2021, @08:29PM (#1095773)

        Very true, which is why ECC has traditionally targeted business-critical machines, where most of the RAM is likely to be in use, and any data corruption can get very expensive.

        However, even for the rest of us there's no telling where that flipped bit will be. Going by the Google study I referenced elsewhere, with 16GB of RAM you'll average 71 flipped bits in the course of an 8-hour day. Even if only 1.4% of your RAM is holding information where a flipped bit will matter, you'll average one "important" flipped bit per day.

        Now, how important is that really? Are you going to lose hours of work or clear out your bank account as a result? Probably not, probably it's just a nuisance. Still, RAM is a small part of the overall cost of a typical computer, and ECC RAM should only increases that cost by a small amount in order to virtually eliminate such nuisances.

        ECC also offers the advantage of letting you know immediately that your RAM is faulty, without running a time consuming memory test that may well not detect the error. An intermittent error can be the most aggravating to identify, and buying new RAM just to see it if that fixes the problem is an expensive option - assuming your computer even has replaceable RAM. Having the errors be silently repaired and (potentially) logged means you don't have the aggravation of memory errors unless the RAM is *very* faulty, and if you do have faulty RAM it will be very obvious. You can even track error rates over time to see if the problem is worsening, or if swapping stick positions resolves interference that was flipping bits in a vulnerable stick.

        And then there's the fact that not all bit flips are normal - RowHammer and related memory attacks vectors work by overwhelming memory with atypical usage patterns designed to flip bits in adjacent memory locations the attacking software doesn't have direct access to. ECC could also go a long way to protecting from such attacks.