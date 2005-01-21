from the bit-flip-out dept.
Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC
There's nothing quite like some fun holiday-weekend reading as a fiery mailing list post by Linus Torvalds. The Linux creator is out with one of his classical messages, which this time is arguing over the importance of ECC memory and his opinion on how Intel's "bad policies" and market segmentation have made ECC memory less widespread.
Linus argues that error-correcting code (ECC) memory "absolutely matters" but that "Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting [to] do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners"."
Ian Cutress from AnandTech points out in a reply that AMD's Ryzen ECC support is not as solid as believed.
(Score: 5, Insightful) by Arik on Wednesday January 06, @03:34AM (18 children)
Or you can save a few fractions of a cent per unit, make garbage, and blow smoke up the arse of the potential purchasing public. And then stiff the minority of purchasers that don't want garbage with a ridiculous premium upcharge for something that should be standard. If you even think their orders are worth your time, at said ridiculous premium, which you probably don't.
Guess which choice absolutely every manufacturer went to in short order?
In a healthy market this sort of scam has a very short lifespan. They've been doing this to us for half a century now. This is not a healthy market. Change my mind.
The *other* sort of Marxist.
(Score: 5, Insightful) by sjames on Wednesday January 06, @03:53AM
That's the funny thing with markets in our economy. Most of them are unhealthy. The evidence is all around us highlighted in flashing neon.
(Score: 4, Informative) by fustakrakich on Wednesday January 06, @04:25AM (2 children)
Story of our lives. They cut corners everywhere. It's always a coldly calculated risk [wfu.edu]. Our *pillars of society* are just as crooked as the average heroine dealer that cuts his product with Drano
Ok, we paid the ransom. Do I get my dog back? REDЯUM
(Score: 1) by fustakrakich on Wednesday January 06, @04:27AM
A bit Freudian, eh? Too bad spell check doesn't do context...
Ok, we paid the ransom. Do I get my dog back? REDЯUM
(Score: 1, Funny) by Anonymous Coward on Wednesday January 06, @04:28AM
Uh, interesting analogy. Spoken from direct experience? :)
(Score: 2) by RS3 on Wednesday January 06, @04:26AM (6 children)
Somewhere else someone posted some gaming benchmarks showing ECC was noticeably slower. So gamers, don't use ECC.
I know several people who use Xeon-based "workstations", which have ECC RAM, as their main computers. So there's that.
And AMD support ECC more than Intel.
Someone pointed out that few laptops have ECC support.
Years ago RAM wasn't so reliable. Slowly it's gotten better, and parity RAM pretty much phased out.
I've had almost no RAM problems in more than 20 years. A couple of crap brand sticks that were bad in machines I was given (or trash-picked) but I don't think I've ever had something crash or any kind of indication of a flipped bit in any other machines. I run MemTest86 from time to time just to check, and no problems.
But if you're that worried, go with Xeon + ECC.
(Score: 2) by Arik on Wednesday January 06, @04:42AM (3 children)
No, stupid gamers don't use ECC.
Honestly, if this is your level of understanding, there's no point in trying to have a conversation. Ridicule alone is appropriate.
"Years ago RAM wasn't so reliable. Slowly it's gotten better, and parity RAM pretty much phased out."
The minor theoretical improvements in RAM reliability are more than offset by increased RAM density. You've got it bass-ackwards, in other words.
"I've had almost no RAM problems in more than 20 years."
That you correctly diagnosed.
"I run MemTest86 from time to time just to check, and no problems."
Oh? Obviously I was wrong, you're a genius, that's the gold standard right there. If memtest86 from time to time didn't diagnose a memory issue, then clearly you never had one - and if you never had one, then no one did. All in my mind.
"But if you're that worried, go with Xeon + ECC."
Yes, we're talking about the premium and less than certain availability of that choice.
The *other* sort of Marxist.
(Score: 4, Touché) by RS3 on Wednesday January 06, @05:53AM (2 children)
Dude, what's your problem? I used to consider you a friend.
Why does everyone take a post as an absolute statement? In my real life, conversations evolve, interactively. Sorry if I didn't read your mind nor measure up to your standards of what the eff I'm supposed to post here.
Not sure what's wrong with you Arik but I truly hope you find some peace and happiness somewhere. Insults and attack me about effing RAM? I'm truly sorry I tried to contribute. I wish I could delete my posts.
(Score: 1, Funny) by Anonymous Coward on Wednesday January 06, @06:17AM (1 child)
I'll be your friend! Tell me what you want me to say.
(Score: 2) by RS3 on Wednesday January 06, @06:38AM
That Arik forgot to take his meds, but will take them and be better tomorrow.
Not sure if you meant to be funny, but thank you, sincerely, for a good laugh!
(Score: 2) by sjames on Wednesday January 06, @07:45AM (1 child)
You mean you had no problems that you are aware of. I maintain a number of machines with ECC used for simulations. They run just fine, but once in a blue moon, one of them will log a corrected memory error. You could run memtest daily and never happen to catch an error. Memtest is designed to catch failing hardware, not the occasional random bit flip. You'd have to continuously run memtest for months to actually catch that sort of error.
(Score: 0) by Anonymous Coward on Wednesday January 06, @10:53AM
IIRC Google testing showed that DRAM can expect 1 bit flip per gigabyte per month due to background radiation, regardless of brand or type.
(Score: 4, Informative) by Immerman on Wednesday January 06, @06:03AM (3 children)
Hear, hear. As the quantity of RAM increases, the overall error rate goes up too - in 2009 Google published a study based on their servers that determined the error rate was 1 bit error per gigabyte of RAM per 1.8 hours https://en.wikipedia.org/wiki/ECC_memory#Research [wikipedia.org]
Assuming reliability is still about the same, that means a typical 16GB computer can expect 71 bit errors over the course of a typical 8-hour day.
(Score: 0) by Anonymous Coward on Wednesday January 06, @06:18AM (1 child)
And what the fuck does that mean for end-users?
I can just hear the Best Buy rep telling me about 71 bit errors per 1.8 hours.
(Score: 2) by Immerman on Wednesday January 06, @06:49AM
That depends entirely on what you're doing with that RAM.
If you're playing video games - probably nothing much - slight change in the color of one pixel on a texture somewhere, or a bit of a health change, or something warps through geometry as their position changes. Nothing much compared to all the bugs.
If you've got a huge database or spreadsheet open - congratulations, every minute and a half, on average, another piece of data or formatting gets silently corrupted.
And if the error is in the RAM containing the machine code of your program itself.... well then who knows? Almost anything could happen - the software is corrupted, and will no longer work as intended... maybe the corruption is in an infrequently used function that never gets used before you close it down - then nothing happens. Or maybe it's in a core loop of your program, or even operating system, in which case maybe it crashes, or maybe corrupts whatever data it touches - it's kind of like the invoking undefined behavior in a programming language - maybe nothing happens, maybe the computer calls Halts And Catch Fire, or anything in between - you just don't know until it happens.
(Score: 0) by Anonymous Coward on Wednesday January 06, @08:49AM
The actual error rate is something like one bit every couple of decades or so. Which is why people's computers don't crash constantly. If you had 71 errors every day, your computer would crash regularly, often several times a day. Even if you aren't running Windows. The fact that this doesn't happen proves that that number is ridiculous. I have about one crash per year and I overclock. Lots of computers never crash. There are Linux systems out there with uptime above a decade.
For real discussion on the subject (as opposed to a Wikipedia interpretation of a sensationalist journalist's misreading and hyping of a study that actually drew the opposite conclusion) see, for example, here [reddit.com] or Google's actual study here [toronto.edu]. What they found is that while there are large numbers of errors, they are concentrated in about 1-2% of the DIMMs. In other words, while a typical home user might get a bad stick of RAM and have to replace it, a bad stick of RAM in an ECC system turns into an ongoing stream of (mostly correctable) errors to the tune of thousands per day. What's more, the "bad" DIMMs are also highly correlated with machines, so there are a lot of these errors that are actually marginal motherboards or CPUs that end up getting corrected as well.
From the actual study:
And that's including the bad ones that a home user would simply replace.
So Linus, as seems to be the norm these days, is just wrong. ECC is bad for home users because it's much slower. Like 30% slower. Datacenters use ECC because they have server CPUs with huge caches and eight memory channels that tolerate slow RAM, need all the uptime they can get, and can't afford to spend hours troubleshooting RAM problems. Home users aren't datacenters! Home users can afford to fiddle around swapping DIMMs!
Rowhammer isn't relevant. Sure, the hardware is supposed to always correctly execute legal code, and Rowhammer code is legal code. So are side channel attacks, Meltdown and Spectre, and basically every security threat faced by modern computing. Programmers have finally gotten good at
not writing buffer overflowsusing languages that aren't susceptible to buffer overflows, so the security researchers are getting creative. That's good! Security got better. You still have to do it. DDR4 mitigated Rowhammer, and DDR5 is supposed to mitigate it some more. It's only really a problem for DDR3... and ECC doesn't even prevent it!
(Score: 0) by Anonymous Coward on Wednesday January 06, @06:51AM
Different kind of Marxist here. Find me a healthy free market frist.
(Score: 2) by RamiK on Wednesday January 06, @09:32AM
The merits of ECC in general aren't the point. It was always useful and outright essential for productivity loads. It's why workstations and servers paid a premium for it. But while it was annoying, it was justified since ECC memory involved increase costs across the design for both motherboard (the memory controller hub on the northbridge), cpu and ram.
But, things changed.
Around 2011 the northbridge was assimilated into the CPU so there's no longer additional design and validation costs for ECC on the motherboard so long as it's the default. That left the CPU and memory.
Then a couple of years ago AMD designed their memory controllers with ECC support built-in for every model proving it doesn't really cost anything extra to get it done on the CPU / memory hub side of things too.
Finally, and that's where the relevant part of the rant comes in, the most recent memory production nodes ended up so noisy that memory manufacturers are being forced to use ECC internally on their controllers anyhow. They even put it into their standard specs. So, what's happening now is that all the chips (pardon the pan) are in place and there's nothing BoM wise preventing from mass market ECC adoption. That is, except for Intel's market segmentation...
So, with AMD in the game, it's finally a fight worth fighting over for Linus. But that's only been true for the last couple of years really.
compiling...
(Score: 0) by Anonymous Coward on Wednesday January 06, @10:54AM
Or, perhaps, you could regulate your RAM, according to the Second Amendment, Arik? With barrel regulation, or ECC under Federal Militia Rules?
(Score: 0) by Anonymous Coward on Wednesday January 06, @04:13AM (3 children)
In software, "shit happens" has been the accepted notion. We sent out update when we were forced to.
Hardware used to be different. But that was like half a century ago. Hardware is as flaky as software, and that's been the case for a few decades now.
(Score: 2) by canopic jug on Wednesday January 06, @05:48AM (2 children)
In
softwareMicrosoft products, "shit happens" has been the accepted notion. We sent out update when we were forced to.
Hardware used to be different. But that was like half a century ago. Hardware is as flaky as software, and that's been the case for a few decades now.
There. Fixed that for you. The problem has not been software but rather Microsoft products. Over time that has translated to a general expectation of bad engineering in software and to an acceptance of bad design everywhere else. With both computer software and hardware that has become an expectation. Everywhere else, it has merely become accepted, much to the threat of our continued survival as a society. Bill Gates' most lasting legacy, if there is a civilization left after a few years, will be that he made bad engineering acceptable.
Money is not free speech. Elections should not be auctions.
(Score: 2) by Immerman on Wednesday January 06, @06:17AM (1 child)
Really? Microsoft is particularly bad, but I can't say that I've ever used flawless software. And I don't think it's just a problem with expectations - software engineering is HARD. A typical car only has around 1800 parts, and cars always have their faults, because engineers aren't perfect.
If you take a single line of code as very roughly comparable in design complexity to a single car part (ranging from a single bolt out to a complex cast manifold), that means a single large scale piece of software can have around 1,000-5,000x as many "parts" as a typical car, and in any sane world you would expect at *least* a similar increase in flaws.
(Score: 0) by Anonymous Coward on Wednesday January 06, @06:22AM
> Microsoft is particularly bad, but I can't say that I've ever used flawless software.
On Linux they call this User Error or WONTFIX.
(Score: 3, Informative) by dltaylor on Wednesday January 06, @04:55AM (2 children)
I have been running ECC since "like, forever". It has caught a few glitches over the years. My soon to be replaced (I hope) desktops are old Dell servers, with much (PSU, disks, graphics) replaced. They have Xeons (X5570) and ECC memory, and everything that I do where bit rot might matter is done on them.
I also have a couple of old Dell laptops with i7s, which I mostly use web browsing and checking email on the road.
A couple of Ryzens in motherboard supporting ECC, are high on the list as a winter purchase. I do not game (or mine bitcoin), so I can re-use the current desktops' graphics. I still have the original cards, so I can put the servers back to their "intended" use.
(Score: 0) by Anonymous Coward on Wednesday January 06, @05:56AM
Beware of Arik- your post, which is pretty similar to mine, will set him off. Otherwise thanks for your excellent post.
(Score: 1, Funny) by Anonymous Coward on Wednesday January 06, @06:24AM
I saved a fortune on ECC RAM by wrapping the sticks in tin foil. No Errs so far.
(Score: 0) by Anonymous Coward on Wednesday January 06, @06:51AM (2 children)
CAN SOMEONE SAY WHAT ecc ram IS OR WAS - THANKS VERY MUCH
(Score: 0) by Anonymous Coward on Wednesday January 06, @07:10AM
Error-correcting code memory
(Score: 2) by dltaylor on Wednesday January 06, @09:53AM
more completely: ECC is a TLA for Error Checking and Correcting
There are extra bits in the data stream to/from the memory controller, which may be inside the CPU, as in Xeons, to the RAM. The extra bits allow for a code to be stored to the memory, and read back when the memory is accessed that can identify that the data read back is wrong. "Normally" these days (some specialized computers can do more) it allows for any single bit error to be identified, and corrected from the code, and some double bit errors. Back in the days of parity memory, all you knew was that the parity was bad, for single bit errors, but not which bit, so you couldn't fix it, and two flipped bits may have good parity for bad data.
(Score: 2) by bradley13 on Wednesday January 06, @08:47AM (1 child)
Since Intel has failed to support ECC for consumer machines, it's hard to justify buying ECC for personal use. The workstations that I have seen supporting ECC have been really poorly engineering. I remember a couple of Dell Precisions that sounded like jets waiting for takeoff - not something you really want sitting next to your desk. So I'm not sure just Intel is to blame - the whole PC manufacturing market has played along.
But bit-rot definitely happens. One really obvious example: I once saw a presentation that included a live Excel sheet - and Excel had summed up a column of numbers incorrectly. When I pointed it out, at first no one believed me (computers don't make mistakes!). I persuaded the presenter to reload the sheet, and suddenly the total was different. It does happen, and can have genuine consequences. The performance hit is minor, if you are doing anything of actual importance.
Everyone is somebody else's weirdo.
(Score: 0) by Anonymous Coward on Wednesday January 06, @09:04AM
Maybe he had ECC and it caused the problem? You know like vaccines caused AIDS.
(Score: 2) by Rich on Wednesday January 06, @10:16AM
Wow. That's the old Linus. "...these f*ckers happily sold broken hardware...". Yay for him! Did he forget his medications? Will he be cancelled now???
Anyway, because most RAM sits idle most of the time in a "desktop" setting, I think it is a good idea to run soft ECC. Pages that sit idle for a few seconds will be checksummed and locked. On the next access, or at slow periodic intervals, they are checksummed again before being made accessible. While that will not be perfectly reliable, it will give a good idea with which likelihood a bit error occurs on a given machine. Being busy or sitting idle does not have any effect on the reliability of DRAM, save for Rowhammer-like events. It's mostly about where the random cosmic ray hits. I assume at least. And if it is not, statistics would indicate that, so all the better to have them. I think the kernel actually has such features, but it would be up to the distros to package them in an accessible way.