Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday January 06 2021, @03:27AM   Printer-friendly
from the bit-flip-out dept.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

There's nothing quite like some fun holiday-weekend reading as a fiery mailing list post by Linus Torvalds. The Linux creator is out with one of his classical messages, which this time is arguing over the importance of ECC memory and his opinion on how Intel's "bad policies" and market segmentation have made ECC memory less widespread.

Linus argues that error-correcting code (ECC) memory "absolutely matters" but that "Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting [to] do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners"."

Ian Cutress from AnandTech points out in a reply that AMD's Ryzen ECC support is not as solid as believed.

Related: Linus Torvalds: 'I'm Not a Programmer Anymore'
Linus Torvalds Rejects "Beyond Stupid" Intel Security Patch From Amazon Web Services
Linus Torvalds: Don't Hide Rust in Linux Kernel; Death to AVX-512
Linus Torvalds Doubts Linux will Get Ported to Apple M1 Hardware


Original Submission

Related Stories

Linus Torvalds: 'I'm Not a Programmer Anymore' 30 comments

Submitted via IRC for soylent_blue

Linus Torvalds: 'I'm not a programmer anymore'

Linus Torvalds, Linux's creator, doesn't make speeches anymore. But, what he does do, and he did again at Open Source Summit Europe in Lyon France is have public conversations with his friend Dirk Hohndel, VMware's Chief Open Source Officer. In this keynote discussion, Torvalds revealed that he doesn't think he's a programmer anymore.

So what does the person everyone thinks of as a programmer's programmer do instead? Torvalds explained:

I don't know coding at all anymore. Most of the code I write is in my e-mails. So somebody sends me a patch ... I [reply with] pseudo code. I'm so used to editing patches now I sometimes edit patches and send out the patch without having ever tested it. I literally wrote it in the mail and say, 'I think this is how it should be done,' but this is what I do, I am not a programmer.

So, Hohndel asked, "What is your job?" Torvalds replied, "I read and write a lot of email. My job really is, in the end, is to say 'no.' Somebody has to say 'no' to [this patch or that pull request]. And because developers know that if they do something that I'll say 'no' to, they do a better job of writing the code."

Torvalds continued, "Sometimes the code changes are so obvious that no messages [are] really required, but that is very very rare." To help your code pass muster with Torvalds it helps to ''explain why the code does something and why some change is needed because that in turn helps the managerial side of the equation, where if you can explain your code to me, I will trust the code."

In short, these days Torvalds is a code manager and maintainer, not a developer. That's fine with him: "I see one of my primary goals to be very responsive when people send me patches. I want to be like, I say yes or no within a day or two. During a merge, the day or two may stretch into a week, but I want to be there all the time as a maintainer."

That's what code maintainers should do.


Original Submission

Linus Torvalds Rejects "Beyond Stupid" Intel Security Patch From Amazon Web Services 50 comments

Linus Torvalds rejects 'beyond stupid' AWS-made Linux patch for Intel CPU Snoop attack

Linux kernel head Linus Torvalds has trashed a patch from Amazon Web Services (AWS) engineers that was aimed at mitigating the Snoop attack on Intel CPUs discovered by an AWS engineer earlier this year. [...] AWS engineer Pawel Wieczorkiewicz discovered a way to leak data from an Intel CPU's memory via its L1D cache, which sits in CPU cores, through 'bus snooping' – the cache updating operation that happens when data is modified in L1D.

In the wake of the disclosure, AWS engineer Balbir Singh proposed a patch for the Linux kernel for applications to be able to opt in to flush the L1D cache when a task is switched out. [...] The feature would allow applications on an opt-in basis to call prctl(2) to flush the L1D cache for a task once it leaves the CPU, assuming the hardware supports it.

But, as spotted by Phoronix, Torvalds believes the patch will allow applications that opt in to the patch to degrade CPU performance for other applications.

"Because it looks to me like this basically exports cache flushing instructions to user space, and gives processes a way to just say 'slow down anybody else I schedule with too'," wrote Torvalds yesterday. "In other words, from what I can tell, this takes the crazy 'Intel ships buggy CPU's and it causes problems for virtualization' code (which I didn't much care about), and turns it into 'anybody can opt in to this disease, and now it affects even people and CPU's that don't need it and configurations where it's completely pointless'."


Original Submission

Linus Torvalds: Don't Hide Rust in Linux Kernel; Death to AVX-512 50 comments

Linus Torvalds' Initial Comment On Rust Code Prospects Within The Linux Kernel

Kernel developers appear to be eager to debate the merits of potentially allowing Rust code within the Linux kernel. Linus Torvalds himself has made some initial remarks on the topic ahead of the Linux Plumbers 2020 conference where the matter will be discussed at length.

[...] Linus Torvalds chimed in though with his own opinion on the matter. Linus commented that he would like it to be effectively enabled by default to ensure there is widespread testing and not any isolated usage where developers then may do "crazy" things. He isn't calling for Rust to be a requirement for the kernel but rather if the Rust compiler is detected on the system, Kconfig would enable the Rust support and go ahead in building any hypothetical Rust kernel code in order to see it's properly built at least.

Linus Torvalds Wishes Intel's AVX-512 A Painful Death

According to a mailing list post spotted by Phoronix, Linux creator Linus Torvalds has shared his strong views on the AVX-512 instruction set. The discussion arose as a result of recent news that Intel's upcoming Alder Lake processors reportedly lack support for AVX-512.

Torvalds' advice to Intel is to focus on things that matter instead of wasting resources on new instruction sets, like AVX-512, that he feels aren't beneficial outside the HPC market.

Related: Rust 1.0 Finally Released!
Results of Rust Survey 2016
AVX-512: A "Hidden Gem"?
Linus Torvalds Rejects "Beyond Stupid" Intel Security Patch From Amazon Web Services


Original Submission

Linus Torvalds Doubts Linux will Get Ported to Apple M1 Hardware 39 comments

Linus Torvalds doubts Linux will get ported to Apple M1 hardware:

In a recent post on the Real World Technologies forum—one of the few public internet venues Linux founder Linus Torvalds is known to regularly visit—a user named Paul asked Torvalds, "What do you think of the new Apple laptop?"

If you've been living under a rock for the last few weeks, Apple released new versions of the Macbook Air, Macbook Pro, and Mac Mini featuring a brand-new processor—the Apple M1.

The M1 processor is a successor to the A12 and A14 Bionic CPUs used in iPhones and iPads, and pairs the battery and thermal efficiency of ultramobile designs with the high performance needed to compete strongly in the laptop and desktop world.

"I'd absolutely love to have one, if it just ran Linux," Torvalds replied. "I've been waiting for an ARM laptop that can run Linux for a long time. The new [Macbook] Air would be almost perfect, except for the OS."

[...] In an interview with ZDNet, Torvalds expounded on the problem:

The main problem with the M1 for me is the GPU and other devices around it, because that's likely what would hold me off using it because it wouldn't have any Linux support unless Apple opens up... [that] seems unlikely, but hey, you can always hope.

[...] It's also worth noting that while the M1 is unabashedly great, it's not the final word in desktop or laptop System on Chip designs. Torvalds mentions that, given a choice, he'd prefer more and higher-power cores—which is certainly possible and seems a likely request to be granted soon.

Previously: Apple's New ARM-Based Macs Won't Support Windows Through Boot Camp
Apple Claims that its M1 SoC for ARM-Based Macs Uses the World's Fastest CPU Core
Your New Apple Computer Isn't Yours


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 5, Interesting) by Arik on Wednesday January 06 2021, @03:34AM (67 children)

    by Arik (4543) on Wednesday January 06 2021, @03:34AM (#1095450) Journal
    I've been ranting about this for more than 2 decades. It's absurd. You can manufacture something a sane, knowledgeable person would buy.

    Or you can save a few fractions of a cent per unit, make garbage, and blow smoke up the arse of the potential purchasing public. And then stiff the minority of purchasers that don't want garbage with a ridiculous premium upcharge for something that should be standard. If you even think their orders are worth your time, at said ridiculous premium, which you probably don't.

    Guess which choice absolutely every manufacturer went to in short order?

    In a healthy market this sort of scam has a very short lifespan. They've been doing this to us for half a century now. This is not a healthy market. Change my mind.
    --
    If laughter is the best medicine, who are the best doctors?
    • (Score: 5, Insightful) by sjames on Wednesday January 06 2021, @03:53AM

      by sjames (2882) on Wednesday January 06 2021, @03:53AM (#1095453) Journal

      That's the funny thing with markets in our economy. Most of them are unhealthy. The evidence is all around us highlighted in flashing neon.

    • (Score: 4, Informative) by fustakrakich on Wednesday January 06 2021, @04:25AM (4 children)

      by fustakrakich (6150) on Wednesday January 06 2021, @04:25AM (#1095461) Journal

      Story of our lives. They cut corners everywhere. It's always a coldly calculated risk [wfu.edu]. Our *pillars of society* are just as crooked as the average heroine dealer that cuts his product with Drano

      --
      La politica e i criminali sono la stessa cosa..
      • (Score: 1) by fustakrakich on Wednesday January 06 2021, @04:27AM

        by fustakrakich (6150) on Wednesday January 06 2021, @04:27AM (#1095464) Journal

        A bit Freudian, eh? Too bad spell check doesn't do context...

        --
        La politica e i criminali sono la stessa cosa..
      • (Score: 1, Funny) by Anonymous Coward on Wednesday January 06 2021, @04:28AM (1 child)

        by Anonymous Coward on Wednesday January 06 2021, @04:28AM (#1095465)

        are just as crooked as the average heroine dealer that cuts his product with Drano

        Uh, interesting analogy. Spoken from direct experience? :)

      • (Score: 3, Funny) by Azuma Hazuki on Wednesday January 06 2021, @12:45PM

        by Azuma Hazuki (5086) on Wednesday January 06 2021, @12:45PM (#1095585) Journal

        I scrub with Ajax you insensitive clod :D

        --
        I am "that girl" your mother warned you about...
    • (Score: 3, Interesting) by RS3 on Wednesday January 06 2021, @04:26AM (25 children)

      by RS3 (6367) on Wednesday January 06 2021, @04:26AM (#1095463)

      Somewhere else someone posted some gaming benchmarks showing ECC was noticeably slower. So gamers, don't use ECC.

      I know several people who use Xeon-based "workstations", which have ECC RAM, as their main computers. So there's that.

      And AMD support ECC more than Intel.

      Someone pointed out that few laptops have ECC support.

      Years ago RAM wasn't so reliable. Slowly it's gotten better, and parity RAM pretty much phased out.

      I've had almost no RAM problems in more than 20 years. A couple of crap brand sticks that were bad in machines I was given (or trash-picked) but I don't think I've ever had something crash or any kind of indication of a flipped bit in any other machines. I run MemTest86 from time to time just to check, and no problems.

      But if you're that worried, go with Xeon + ECC.

      • (Score: 3, Insightful) by Arik on Wednesday January 06 2021, @04:42AM (11 children)

        by Arik (4543) on Wednesday January 06 2021, @04:42AM (#1095476) Journal
        "Somewhere else someone posted some gaming benchmarks showing ECC was noticeably slower. So gamers, don't use ECC."

        No, stupid gamers don't use ECC.

        Honestly, if this is your level of understanding, there's no point in trying to have a conversation. Ridicule alone is appropriate.

        "Years ago RAM wasn't so reliable. Slowly it's gotten better, and parity RAM pretty much phased out."

        The minor theoretical improvements in RAM reliability are more than offset by increased RAM density. You've got it bass-ackwards, in other words.

        "I've had almost no RAM problems in more than 20 years."

        That you correctly diagnosed.

        "I run MemTest86 from time to time just to check, and no problems."

        Oh? Obviously I was wrong, you're a genius, that's the gold standard right there. If memtest86 from time to time didn't diagnose a memory issue, then clearly you never had one - and if you never had one, then no one did. All in my mind.

        "But if you're that worried, go with Xeon + ECC."

        Yes, we're talking about the premium and less than certain availability of that choice.

        --
        If laughter is the best medicine, who are the best doctors?
        • (Score: 4, Touché) by RS3 on Wednesday January 06 2021, @05:53AM (10 children)

          by RS3 (6367) on Wednesday January 06 2021, @05:53AM (#1095504)

          Dude, what's your problem? I used to consider you a friend.

          Why does everyone take a post as an absolute statement? In my real life, conversations evolve, interactively. Sorry if I didn't read your mind nor measure up to your standards of what the eff I'm supposed to post here.

          Not sure what's wrong with you Arik but I truly hope you find some peace and happiness somewhere. Insults and attack me about effing RAM? I'm truly sorry I tried to contribute. I wish I could delete my posts.

          • (Score: 0, Troll) by Anonymous Coward on Wednesday January 06 2021, @06:17AM (1 child)

            by Anonymous Coward on Wednesday January 06 2021, @06:17AM (#1095509)

            I'll be your friend! Tell me what you want me to say.

            • (Score: 2) by RS3 on Wednesday January 06 2021, @06:38AM

              by RS3 (6367) on Wednesday January 06 2021, @06:38AM (#1095520)

              That Arik forgot to take his meds, but will take them and be better tomorrow.

              Not sure if you meant to be funny, but thank you, sincerely, for a good laugh!

          • (Score: 2) by Arik on Wednesday January 06 2021, @09:59PM (6 children)

            by Arik (4543) on Wednesday January 06 2021, @09:59PM (#1095844) Journal
            "Dude, what's your problem? I used to consider you a friend."

            Dude, what's your problem? I can no longer be your friend because I disagreed with something you posted? Really?

            "Why does everyone take a post as an absolute statement?"

            If that's how it's written, then how else would you expect anyone to take it? That's one of the reasons conversations go back and forth, isn't it?

            "In my real life, conversations evolve, interactively."

            Exactly.

            "Sorry if I didn't read your mind nor measure up to your standards of what the eff I'm supposed to post here."

            I don't even know what you mean, but I'm truly sorry if I offended you. I thought was just talking about RAM.
            --
            If laughter is the best medicine, who are the best doctors?
            • (Score: 0) by Anonymous Coward on Thursday January 07 2021, @06:28AM (5 children)

              by Anonymous Coward on Thursday January 07 2021, @06:28AM (#1096315)

              You changed the subject to "Stupid. Stupid. Stupid." and your post was dripping with derision. You had to know what you were doing.

              • (Score: 2) by Arik on Thursday January 07 2021, @09:14AM (4 children)

                by Arik (4543) on Thursday January 07 2021, @09:14AM (#1096379) Journal
                I was pointing out the comment was stupid. It wasn't the first dumb comment to be posted, far from it. Happens to everyone, sooner or later. I've made a few myself. Participating in free and open debate and discussion with adults means accepting the possibility someone may criticize your posting vigorously. This isn't kindergarten and it's not a safe space for people that can't take criticism. You can have vigorous, adult discussions, or you can have a safe space for people who are easily offended. You can't have both.
                --
                If laughter is the best medicine, who are the best doctors?
                • (Score: 0) by Anonymous Coward on Thursday January 07 2021, @10:34AM (3 children)

                  by Anonymous Coward on Thursday January 07 2021, @10:34AM (#1096394)

                  I'm not telling you to change your tone. You can be as big of an asshole as you can live with and then some. I'm just telling you that you and everyone else knows exactly what you did and you should not act surprised and feign regret when you get called out for it. No one is falling for it and definitely not after that double down.

                  • (Score: 2) by Arik on Friday January 08 2021, @10:40AM (2 children)

                    by Arik (4543) on Friday January 08 2021, @10:40AM (#1096943) Journal
                    "I'm not telling you to change your tone."

                    No, no, that's exactly what you're doing. And you should own it.

                    The one thing wrong with my post was tone. Wasn't the best, would have obviously been better received if I had sugar coated it. If I'd had a better day earlier I probably would have.

                    That's all the regret you'll get out of me on this. I'm not here to make friends, I don't believe in online friends even if I had the time. Had enough of that. I'm a voice in the wilderness and I'm crying from compulsion. You're not meant to be my friend, you're not meant to know who I am. You're meant to hear my words, if they are meant for you; and if not then to wonder why I howl for a fleeting instant, before shrugging your shoulders and going back to your life.

                    Since the "friend" crap started showing up on websites over 20 years ago I went through the wtf stage and got to the cannibalize stage. I mark an online "friend" in order to give a karma bonus to people that have demonstrated the ability to make interesting posts, in order to help ensure I see them in the future.

                    Interesting posts - not necessarily ones I agree with. Not nice posts, not even necessarily diplomatic posts (though making those is a skill that fascinates me as it is so alien and does draw extra attention from me) but telling posts. Ones that cut right through the distractions and strike the root.

                    They're a minority of posts, no one hits that goal every time /and I don't want anyone to think they need to./

                    If you post absolute garbage 9 times out of 10, and really make me think that last 1, I'll "friend" you and keep you there. I might reply to the garbage posts, call them garbage posts, might get you so upset you "unfriend" me and so on... worst case.

                    Don't care, doesn't matter. It's not a relationship. It's a flag in a database.

                    This is not social media.
                    --
                    If laughter is the best medicine, who are the best doctors?
                    • (Score: 0) by Anonymous Coward on Saturday January 09 2021, @12:37AM (1 child)

                      by Anonymous Coward on Saturday January 09 2021, @12:37AM (#1097224)

                      Still pretending. So sad. But not as sad as if you weren't. Either you regret it or you don't. But don't pretend like you do and don't at the same time. Either own it or don't.

                      P.S. As an aside, you may want to look up the definition of "social media."

                      • (Score: 1) by Arik on Tuesday January 12 2021, @01:05AM

                        by Arik (4543) on Tuesday January 12 2021, @01:05AM (#1098701) Journal
                        "Either you regret it or you don't."

                        Aristotelian nonsense. "A foolish consistency is the hobgoblin of little minds."

                        --
                        If laughter is the best medicine, who are the best doctors?
          • (Score: 0) by Anonymous Coward on Thursday January 07 2021, @03:15PM

            by Anonymous Coward on Thursday January 07 2021, @03:15PM (#1096467)

            C'mon, it's pretty obvious that Arik is a total asshat. The fact that he insists on posting everything in Arik Mono is evidence enough.

      • (Score: 5, Informative) by sjames on Wednesday January 06 2021, @07:45AM (10 children)

        by sjames (2882) on Wednesday January 06 2021, @07:45AM (#1095534) Journal

        You mean you had no problems that you are aware of. I maintain a number of machines with ECC used for simulations. They run just fine, but once in a blue moon, one of them will log a corrected memory error. You could run memtest daily and never happen to catch an error. Memtest is designed to catch failing hardware, not the occasional random bit flip. You'd have to continuously run memtest for months to actually catch that sort of error.

        • (Score: 2, Interesting) by Anonymous Coward on Wednesday January 06 2021, @10:53AM

          by Anonymous Coward on Wednesday January 06 2021, @10:53AM (#1095565)

          IIRC Google testing showed that DRAM can expect 1 bit flip per gigabyte per month due to background radiation, regardless of brand or type.

        • (Score: 2) by RS3 on Wednesday January 06 2021, @03:22PM (8 children)

          by RS3 (6367) on Wednesday January 06 2021, @03:22PM (#1095638)

          Yes, I know about hardware, memtest, bit-flips, etc., but thanks for the general info.

          They run just fine, but once in a blue moon, one of them will log a corrected memory error.

          Which OS, and what software is logging, and where are the logs?

          I maintain both Linux and Windows servers and I've never seen a logged RAM error, so I'm wondering if I'm missing something- missing a log somewhere that I don't know about...

          • (Score: 3, Informative) by sjames on Wednesday January 06 2021, @07:00PM (3 children)

            by sjames (2882) on Wednesday January 06 2021, @07:00PM (#1095717) Journal

            CentOS. It shows up in /var/log/messages and on the console. Debian puts it in syslog. It can also be found in the system event log. I have seen it on Cisco and on Supermicro hardware. I know that not all MBs support reporting ECC issues to the kernel.

            • (Score: 2) by RS3 on Wednesday January 06 2021, @07:53PM (1 child)

              by RS3 (6367) on Wednesday January 06 2021, @07:53PM (#1095750)

              Thank you for that. [shuffles off to check logs...]

              Some CentOS here too. I check /var/log/messages, "dmesg", daily "logwatch" email, lots of goodies in /sys and /proc, but I've never seen a RAM error message. Obviously that doesn't mean it never happens, just that I've never seen it. Haven't been running the Open Manage software, but it might help. Had too many problems getting it to work well. Well, I do run arcconf but not the rest of it.

              • (Score: -1, Troll) by Anonymous Coward on Wednesday January 06 2021, @08:12PM

                by Anonymous Coward on Wednesday January 06 2021, @08:12PM (#1095764)

                and you're not running with ECC ram. he said it logged a correction. are you daft?

            • (Score: 2) by RS3 on Wednesday January 06 2021, @08:08PM

              by RS3 (6367) on Wednesday January 06 2021, @08:08PM (#1095761)

              Sorry- hit "submit" too soon (as I do too often...)

              Yeah, it may be a matter of MB driver support. Dell has fairly good Open Manage modules for some of their hardware, but I don't know the extent. I run it on the Windows servers, but was having SW problems with one so I disabled it... [shuffles off again...]

              Well, I ran the Dell Open Manage, checked the logs, and there was 1 ECC bit correction in September, and no other ECC or RAM messages in a year. I can live with that. :)

              I might look into running the Open Manage on the Linux servers. I don't run them in X mode, but have Windows workstations I remote into that I can display the output on (Cygwin/X, etc.)

          • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @08:30PM (3 children)

            by Anonymous Coward on Wednesday January 06 2021, @08:30PM (#1095774)

            You have to enable Machine Check Exceptions (MCE) in your BIOS. Then you should see it in the system logs (Linux). On Debian 10 you need to install "collectd-core" iirc, don't know about other distributions. On Windows I presume the MCE's will show up in the logs as well but I don't have any (important) Windows boxes to check.
            One nice link I have in my bookmarks is http://mindofjim.blogspot.com/2010/01/i-see-see-ecc.html [blogspot.com]

            • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @09:27PM

              by Anonymous Coward on Wednesday January 06 2021, @09:27PM (#1095816)

              That package depends on the actual daemon for reporting it. Instead, you can use them directly with mcelog for older kernels and rasdaemon for newer kernels.

            • (Score: 2) by RS3 on Thursday January 07 2021, @05:43AM (1 child)

              by RS3 (6367) on Thursday January 07 2021, @05:43AM (#1096286)

              Thank you for all the good info. Actually there's no such setting in the BIOS (I checked), and I don't think I've ever seen that in a BIOS, but I'll keep an eye out for it. I'm one to get into BIOS on any machine I touch, and at least check settings, etc. Is MCE on by default if there's no MCE setting in the BIOS?

              • (Score: 0) by Anonymous Coward on Thursday January 07 2021, @10:37AM

                by Anonymous Coward on Thursday January 07 2021, @10:37AM (#1096395)

                Most have a toggle but some are always on and some are always off. The only way to know for sure is to check your MCE or RAS status in the sysfs after installing the correct daemon for your kernel.

      • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @01:48PM (1 child)

        by Anonymous Coward on Wednesday January 06 2021, @01:48PM (#1095605)

        But if you're that worried, go with Xeon + ECC.

        I think I'll go with AMD + ECC, thank you very much. No need to pay premium for chips that don't have something disabled on purpose.

        • (Score: 2) by RS3 on Wednesday January 06 2021, @03:42PM

          by RS3 (6367) on Wednesday January 06 2021, @03:42PM (#1095644)

          I think I'll go with AMD + ECC, thank you very much. No need to pay premium for chips that don't have something disabled on purpose.

          I think there's a misunderstanding. Yes, I'd go with AMD no question. I'm NO Intel shill, at all. I was thinking more along the lines of very very cheaply available USED business-class servers and workstations. Most corporations depreciate otherwise great hardware rather quickly, and it's on the used market- ebay, craigslist, etc.

          If you can afford new, by all means have at it. AMD would be my choice if I could afford new.

          I don't think Intel "disabled" ECC "on purpose". In the past, there were large chips (often called "chipset" because it used to be many chips), that interconnect the CPU, RAM, PCI and other internal busses, and the various IO. The chipset handled ECC logic functionality. More and more Intel and AMD have been pulling the chipset functions into the CPU, and Intel simply didn't include the ECC stuff. I don't think it was disabled, but rather left out. And I'm not disagreeing with anyone- it should be included. It frustrates me to no end that useful functionality is designed out of things to save costs (because much of the market doesn't care?) Sadly companies don't do things because it's the right thing to do. All of this seems obvious, to me, but in these online conversations it's like people ignore basic knowledge of capitalism, social values, etc. Point is, I didn't think I had to write all of those disclaimers when it seems like common knowledge.

          Anyway, all that said, ECC logic could still be done, between the CPU and RAM, by any motherboard / system designer that wishes to.

          Also, somewhere I read that because ECC is being supported less and less, that RAM manufacturers are including it it the RAM chips themselves. You may not be able to get at the workings to get stats, but I don't know. It may be available on the i2c bus.

    • (Score: 4, Informative) by Immerman on Wednesday January 06 2021, @06:03AM (23 children)

      by Immerman (3985) on Wednesday January 06 2021, @06:03AM (#1095507)

      Hear, hear. As the quantity of RAM increases, the overall error rate goes up too - in 2009 Google published a study based on their servers that determined the error rate was 1 bit error per gigabyte of RAM per 1.8 hours https://en.wikipedia.org/wiki/ECC_memory#Research [wikipedia.org]

      Assuming reliability is still about the same, that means a typical 16GB computer can expect 71 bit errors over the course of a typical 8-hour day.

      • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @06:18AM (17 children)

        by Anonymous Coward on Wednesday January 06 2021, @06:18AM (#1095511)

        And what the fuck does that mean for end-users?

        I can just hear the Best Buy rep telling me about 71 bit errors per 1.8 hours.

        • (Score: 4, Informative) by Immerman on Wednesday January 06 2021, @06:49AM (1 child)

          by Immerman (3985) on Wednesday January 06 2021, @06:49AM (#1095525)

          That depends entirely on what you're doing with that RAM.

          If you're playing video games - probably nothing much - slight change in the color of one pixel on a texture somewhere, or a bit of a health change, or something warps through geometry as their position changes. Nothing much compared to all the bugs.

          If you've got a huge database or spreadsheet open - congratulations, every minute and a half, on average, another piece of data or formatting gets silently corrupted.

          And if the error is in the RAM containing the machine code of your program itself.... well then who knows? Almost anything could happen - the software is corrupted, and will no longer work as intended... maybe the corruption is in an infrequently used function that never gets used before you close it down - then nothing happens. Or maybe it's in a core loop of your program, or even operating system, in which case maybe it crashes, or maybe corrupts whatever data it touches - it's kind of like the invoking undefined behavior in a programming language - maybe nothing happens, maybe the computer calls Halts And Catch Fire, or anything in between - you just don't know until it happens.

          • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @10:59AM

            by Anonymous Coward on Wednesday January 06 2021, @10:59AM (#1095568)

            It's worse than that. A single bit flipped at the OS level can mean a crash if you're lucky or a corrupted hard drive if you're not.

        • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @11:34AM (14 children)

          by Anonymous Coward on Wednesday January 06 2021, @11:34AM (#1095572)

          It all depends on what gets hit. A single flipped bit in the wrong place can mean a corrupted filesystem and lost data.

          • (Score: 2) by barbara hudson on Wednesday January 06 2021, @02:27PM (13 children)

            by barbara hudson (6443) <barbara.Jane.hudson@icloud.com> on Wednesday January 06 2021, @02:27PM (#1095616) Journal

            Your desktop has a solid metal case? Problem solved for external gamma rays. Any bit flipping will be from the noisy Ed environment inside the case, not gamma radiation.

            Now if you're a smoker, you're probably the source of decay particle radiation. Those broad tobacco leaves have been sitting out in the field collecting fallout as they grow, and your lungs are concentrating the particles. It's why, at the beginning of the 20th century lung cancer was so rare that if a case presented in a hospital, doctors on rounds would be told "look carefully, you'll probably never see another case in your career."

            That all changed with atmospheric nuclear bomb testing. The fallout from fallout will be with us for another 100 years.

            --
            SoylentNews is social media. Says so right in the slogan. Soylentnews is people, not tech.
            • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @03:41PM (12 children)

              by Anonymous Coward on Wednesday January 06 2021, @03:41PM (#1095643)

              The issue is cosmic rays and this was discovered before cases got to be so cheap.

              • (Score: 2) by RS3 on Wednesday January 06 2021, @03:54PM (11 children)

                by RS3 (6367) on Wednesday January 06 2021, @03:54PM (#1095654)

                Yes, and AFAIK "cosmic rays" (high energy gamma) go through things pretty easily, including steel cases. You need very thick steel, lead, concrete, to significantly reduce them.

                Although cosmic ray / random bit-flips are a problem, I think the "rowhammer" problem is what Linus was more unhappy about (and I with him).

                • (Score: 2) by barbara hudson on Wednesday January 06 2021, @06:54PM (1 child)

                  by barbara hudson (6443) <barbara.Jane.hudson@icloud.com> on Wednesday January 06 2021, @06:54PM (#1095712) Journal
                  There's more gamma radiation at the top of a house than the ground floor. Just an extra floor of air will soak up a few gamma particles. More so at the top of a skyscraper than the middle of an open parking lot. A jet at altitude? Even more so.

                  A steel case will do a better job than moving from your top floor to your basement with a plastic case ever will. Of course, in the basement you also have problems with radon and radiation from concrete and gravel construction materials. Even granite countertops.

                  Linus flipped a bit. It's not like the cpu, the various controllers, even the one in your keyboard, are immune. It's really a non issue for most people, so why should everyone pay extra for stuff they don't need that will reduce performance? Getting a bit spoiled in his dotage. If it's important to him, let him pay the extra cost and take the performance hit. He's sounding more like RMS every day - his way is the best way and we should all just do as he says.

                  --
                  SoylentNews is social media. Says so right in the slogan. Soylentnews is people, not tech.
                  • (Score: 2) by RS3 on Wednesday January 06 2021, @07:12PM

                    by RS3 (6367) on Wednesday January 06 2021, @07:12PM (#1095723)

                    All excellent points, thanks. Yes, I'm aware that for example, airline workers are at greater risk for radiation-induced health issues.

                    My concern, and what I think I'm gathering from Linus' rant, is the shift away from ECC. I think he's bothered by the lack of choice, and that choice being taken away for economic (cheapening) reasons, rather than solid technical / quality improvement reasons. As an engineer, the #1 thing I'm constantly told to put on my resume is how I cheapened something. The good news is that Intel is losing market share.

                    But I think another factor, at least has been in my professional life, is the people who make the purchasing decisions are generally not very technically-minded, and don't generally ask for technical advice from me or colleagues. They make "business" decisions about the hardware and software, OSes, development tools, etc., and foist it on us underlings. Specifically, when offered the option of ECC RAM, they simply say "oh, it costs more and gives indeterminate benefit? No thanks." So Intel starts cutting it out of their products. Hopefully it hurts them more and more.

                    But the point is, it's not just Intel, but (idiot) buyers who are driving ECC down.

                    All that said, as I posted elsewhere in this discussion, there are RAM chip manufacturers making chips with ECC built in. :) Not sure if ECC stats can be gleaned out though- maybe, could be...

                    https://www.intelligentmemory.com/ECC-DRAM/DDR3/ [intelligentmemory.com]

                • (Score: 3, Informative) by Immerman on Thursday January 07 2021, @05:36AM (8 children)

                  by Immerman (3985) on Thursday January 07 2021, @05:36AM (#1096280)

                  FYI cosmic rays are not high energy gamma radiation (electromagnetic radiation), instead they're protons and other atomic nuclei traveling at almost light speed. https://en.wikipedia.org/wiki/Cosmic_ray [wikipedia.org]

                  • (Score: 2) by RS3 on Thursday January 07 2021, @05:53AM (7 children)

                    by RS3 (6367) on Thursday January 07 2021, @05:53AM (#1096295)

                    Thank you for the correction. I'm a bit confused as to how cosmic rays, being alpha particles? can penetrate things and mess with electronics? I thought only gamma could pass through objects... I guess it's their extreme energy? I see that cosmic rays were originally thought to be gamma, so maybe my brain is in 1920s, like my Model A Ford. :)

                    • (Score: 3, Informative) by Immerman on Thursday January 07 2021, @04:46PM (6 children)

                      by Immerman (3985) on Thursday January 07 2021, @04:46PM (#1096516)

                      Yeah, seems like there's always another old misunderstanding to relearn, doesn't it? I suppose it keeps things from getting boring.

                      There's also neutron radiation that can pass through objects since they don't interact with the electron field - but neutrons are so unstable that they decay into hydrogen long before they cross the interstellar void - even at cosmic ray speeds.

                      Yeah, I think it's all about the energy, aka speed. Cosmic rays are traveling at particle-collider speeds, the really high energy ones absolutely dwarf anything we can produce in the LHC. They're just moving far too fast for the deflective force of electron clouds to have much effect before they're past. Kind of like comparing how far gravity deflects a softball pitch over the course of a single meter, versus how far it deflects a high-speed bullet.

                      As I understand it, high energy cosmic rays can actually have far more penetrating power than neutron radiation. I think that's probably down to speed again - QM is all probabilistic, so go fast enough and I think you can pass directly through the nucleus itself and come out the other side before the probability of interacting reaches 100%. Or maybe it's that the powerful repulsive effect of the nucleus on a solitary proton is enough to deflect a lot of would-be hits to near misses - a nucleus is an infinitesimally tiny target after all, about half a around one 26,000th the diameter of an atom, or almost a 700 millionth of the cross-sectional area. Possibly it's a combination of both.

                      • (Score: 2) by RS3 on Thursday January 07 2021, @04:55PM (5 children)

                        by RS3 (6367) on Thursday January 07 2021, @04:55PM (#1096525)

                        Wow, thanks for all of that. Are you a physicist?

                        I've always been a science enthusiast, but I admit I don't delve in deeply, so I end up with "dangerous" knowledge. For instance, I know a little about nuclear reactors, but I didn't know that neutrons decay into hydrogen. Not sure how I missed that along the way. But it explains the problem with hydrogen in nuclear reactors. That aside, it kind of blows my mind. What a bizarre thing a neutron is! What the heck are we even made of anyway? :)

                        • (Score: 2) by Immerman on Thursday January 07 2021, @05:50PM (4 children)

                          by Immerman (3985) on Thursday January 07 2021, @05:50PM (#1096544)

                          Nah, just a fellow science enthusiast - but I have a tendency to delve too deep for my own good sometimes, especially around physics. Probably still lots of "dangerous" knowledge mixed in, just a lot deeper into the details than usual.

                          I don't think neutron decay has anything to do with the hydrogen problem in reactors - I think that's more a matter of superheated water coolant chemically decomposing into H2 and O2. The amount of fuel in a reactor is tiny, and thus the maximum amount of hydrogen that could be produced through neutron decay is similarly tiny. Plus, virtually all emitted neutrons are absorbed into other nuclei, either triggering fission, or transmuting shielding, etc. into heavier isotopes. Free neutrons have a half-life of about 15 minutes - any neutron ejected from a reactor that didn't interact with anything for 15 minutes... would probably have already traveled a long way away from the power plant.

                          Basically though, beta radiation (an electron) is the result of a neutron within the nucleus decaying into a proton plus electron, with the electron carrying away the excess decay energy in the form of kinetic energy.

                          • (Score: 2) by RS3 on Thursday January 07 2021, @06:17PM (3 children)

                            by RS3 (6367) on Thursday January 07 2021, @06:17PM (#1096559)

                            Good for you! I do delve into some things, but tend toward the more tactile- things I can take apart by myself with a few tools. :) Notice I didn't say anything about putting them back together again...

                            I was thinking the same thing- the neutron flux in a reactor wouldn't be enough to create significant hydrogen.

                            I never really studied neutrons. For whatever reason I had it in my head that neutrons were very stable things, like little lead balls.

                            I do know that neutrons must be moderated (whatever that really means) in a reactor- if they're too fast, reaction doesn't happen.

                            Here's a ponderment: do neutrons ever collide with their own electrons when trying to escape?

                            Another: do the neutrons and protons in a nucleus stay in a stable fixed relative position, or are they moving around and churning?

                            • (Score: 3, Informative) by Immerman on Friday January 08 2021, @01:23AM (2 children)

                              by Immerman (3985) on Friday January 08 2021, @01:23AM (#1096784)

                              >For whatever reason I had it in my head that neutrons were very stable things, like little lead balls.
                              I mean protons are, why not neutrons, right? I mean, you hear "bits of an atom, you figure that';s as far as it goes, right? Then you get into quarks and quantum chromodynamics... and everything you thought you knew goes weird.

                              I think (don't quote me) that moderation is basically a matter of putting a bunch of nuclei in the way (typically graphite or water molecules) that the neutrons will collide with without reacting. Each collision transfers some of the neutron's kinetic energy to the target, slowing the neutron to thermal speeds so that it can more easily react with more receptive nuclei.

                              >do neutrons ever collide with their own electrons when trying to escape?
                              I don't think that question makes sense. So long as a Neutron is a neutron, it doesn't have an electron - at best it'd be like asking if you ever collide with your own bones when trying to run. Except I don't think a neutron contains an electron either. It sounds like a neutron decays into a proton and W- boson (???), and the W- boson then transforms into an electron and electron antineutrino.

                              Particles physics is weird - properties like spin and charge (including color charge and other subatomic weirdness) are conserved, but particles themselves can convert to and from raw energy pop in and out of existence provided enough energy is available (or not - virtual particles spawn out of nothing, along with their antiparticle, by "borrowing" energy from the quantum vacuum, which is "repaid" when they collide and annihilate shortly thereafter.)

                              >do the neutrons and protons in a nucleus stay in a stable fixed relative position, or are they moving around and churning?
                              That's a good question. My gut feeling is that there'd be some churn, but I suspect it'll be a VERY long time before we can actually measure changes small enough to have any real clue as to the answer. Perhaps half-life is related to the amount of "churn" in the nucleus? The more things are moving around, the more likely it is that something will get collide just wrong and get ejected?

                              • (Score: 2) by RS3 on Friday January 08 2021, @02:05AM (1 child)

                                by RS3 (6367) on Friday January 08 2021, @02:05AM (#1096797)

                                Thank you for awesomeness!

                                >do neutrons ever collide with their own electrons when trying to escape?

                                What I meant was, when a neutron escapes from nucleus, how does it get through the electron clouds without ever colliding with one?

                                • (Score: 2) by Immerman on Friday January 08 2021, @03:14PM

                                  by Immerman (3985) on Friday January 08 2021, @03:14PM (#1096996)

                                  Gotcha. Hmm...I suppose it might - I don't really know QM well enough to say anything for sure. It's probably not very common though. To start with, a classic electron is far smaller than a nucleus, which is itself around 1/10,000th the size of the atom - even with a hundred classical "billiard ball" electrons orbiting the nucleus, the odds that any of them would be directly in the path of a neutron as it comes rocketing out of the nucleus are *extremely* low. And in reality electrons are more of a distributed wavefunction that interacts almost entirely via electric charge, which the neutron doesn't have. I believe neutrons do have spin, which might interact with an electron... but I think spin is mostly factor in when two identical wavefunctions are trying to occupy exactly the same space.

      • (Score: 3, Informative) by Anonymous Coward on Wednesday January 06 2021, @08:49AM (4 children)

        by Anonymous Coward on Wednesday January 06 2021, @08:49AM (#1095546)

        The actual error rate is something like one bit every couple of decades or so. Which is why people's computers don't crash constantly. If you had 71 errors every day, your computer would crash regularly, often several times a day. Even if you aren't running Windows. The fact that this doesn't happen proves that that number is ridiculous. I have about one crash per year and I overclock. Lots of computers never crash. There are Linux systems out there with uptime above a decade.

        For real discussion on the subject (as opposed to a Wikipedia interpretation of a sensationalist journalist's misreading and hyping of a study that actually drew the opposite conclusion) see, for example, here [reddit.com] or Google's actual study here [toronto.edu]. What they found is that while there are large numbers of errors, they are concentrated in about 1-2% of the DIMMs. In other words, while a typical home user might get a bad stick of RAM and have to replace it, a bad stick of RAM in an ECC system turns into an ongoing stream of (mostly correctable) errors to the tune of thousands per day. What's more, the "bad" DIMMs are also highly correlated with machines, so there are a lot of these errors that are actually marginal motherboards or CPUs that end up getting corrected as well.

        From the actual study:

        8% of DIMMs in our fleet saw at least one correctable error per year

        And that's including the bad ones that a home user would simply replace.

        So Linus, as seems to be the norm these days, is just wrong. ECC is bad for home users because it's much slower. Like 30% slower. Datacenters use ECC because they have server CPUs with huge caches and eight memory channels that tolerate slow RAM, need all the uptime they can get, and can't afford to spend hours troubleshooting RAM problems. Home users aren't datacenters! Home users can afford to fiddle around swapping DIMMs!

        Rowhammer isn't relevant. Sure, the hardware is supposed to always correctly execute legal code, and Rowhammer code is legal code. So are side channel attacks, Meltdown and Spectre, and basically every security threat faced by modern computing. Programmers have finally gotten good at not writing buffer overflows using languages that aren't susceptible to buffer overflows, so the security researchers are getting creative. That's good! Security got better. You still have to do it. DDR4 mitigated Rowhammer, and DDR5 is supposed to mitigate it some more. It's only really a problem for DDR3... and ECC doesn't even prevent it!

        • (Score: 2) by Immerman on Wednesday January 06 2021, @02:32PM (3 children)

          by Immerman (3985) on Wednesday January 06 2021, @02:32PM (#1095619)

          >> 8% of DIMMs in our fleet saw at least one correctable error per year
          >And that's including the bad ones that a home user would simply replace.

          Oh? And how would the typical home user know that it's memory to blame? Unless your bit-flip error rate is causing at least several bit flips per hour, you have basically zero chance of spotting the problem with a RAM test.

          • (Score: 2) by barbara hudson on Wednesday January 06 2021, @03:44PM (2 children)

            by barbara hudson (6443) <barbara.Jane.hudson@icloud.com> on Wednesday January 06 2021, @03:44PM (#1095647) Journal

            Most errors are meaningless. If you've got 8?gig of ram but are using only 1 gig, most of the errors on dram refresh will not do anything since when you go to use that ram, the first thing you do is overwrite it or zero it out. Any previous flipped bits are irrelevant.

            Most of the code you load never gets exercised, so again, bit flips there are irrelevant. Even much of the data can be bit-flipped without changing anything. After all, you don't rewrite every row in a database when you make a change to one field. And who cares about a single pixel in a 50 meg image? You're going to have more artifacts when you save it in a compressed format anyway.

            And of course, if a bit is flipped in a files memory ifagw after you, the human, have finished reading it, and you don't save it because you, the human, didn't change anything and the software hasn't flagged the file as modified by you, who cares?

            Because most of the code you load never runs - executables have plenty of slack space in them, dead code, functions from libraries that will be run once in a blue moon, etc. Linus knows this. But I guess he's getting old and a bit irrelevant.

            --
            SoylentNews is social media. Says so right in the slogan. Soylentnews is people, not tech.
            • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @06:39PM

              by Anonymous Coward on Wednesday January 06 2021, @06:39PM (#1095706)

              Your damaged brain had its bits flipped right out! You can't even tell that you are really a man, not a deranged wannabe chemically castrated woman.

            • (Score: 2) by Immerman on Wednesday January 06 2021, @08:29PM

              by Immerman (3985) on Wednesday January 06 2021, @08:29PM (#1095773)

              Very true, which is why ECC has traditionally targeted business-critical machines, where most of the RAM is likely to be in use, and any data corruption can get very expensive.

              However, even for the rest of us there's no telling where that flipped bit will be. Going by the Google study I referenced elsewhere, with 16GB of RAM you'll average 71 flipped bits in the course of an 8-hour day. Even if only 1.4% of your RAM is holding information where a flipped bit will matter, you'll average one "important" flipped bit per day.

              Now, how important is that really? Are you going to lose hours of work or clear out your bank account as a result? Probably not, probably it's just a nuisance. Still, RAM is a small part of the overall cost of a typical computer, and ECC RAM should only increases that cost by a small amount in order to virtually eliminate such nuisances.

              ECC also offers the advantage of letting you know immediately that your RAM is faulty, without running a time consuming memory test that may well not detect the error. An intermittent error can be the most aggravating to identify, and buying new RAM just to see it if that fixes the problem is an expensive option - assuming your computer even has replaceable RAM. Having the errors be silently repaired and (potentially) logged means you don't have the aggravation of memory errors unless the RAM is *very* faulty, and if you do have faulty RAM it will be very obvious. You can even track error rates over time to see if the problem is worsening, or if swapping stick positions resolves interference that was flipping bits in a vulnerable stick.

              And then there's the fact that not all bit flips are normal - RowHammer and related memory attacks vectors work by overwhelming memory with atypical usage patterns designed to flip bits in adjacent memory locations the attacking software doesn't have direct access to. ECC could also go a long way to protecting from such attacks.

    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @06:51AM

      by Anonymous Coward on Wednesday January 06 2021, @06:51AM (#1095527)

      Change my mind.

      Different kind of Marxist here. Find me a healthy free market frist.

    • (Score: 4, Informative) by RamiK on Wednesday January 06 2021, @09:32AM (6 children)

      by RamiK (1813) on Wednesday January 06 2021, @09:32AM (#1095553)

      The merits of ECC in general aren't the point. It was always useful and outright essential for productivity loads. It's why workstations and servers paid a premium for it. But while it was annoying, it was justified since ECC memory involved increase costs across the design for both motherboard (the memory controller hub on the northbridge), cpu and ram.

      But, things changed.

      Around 2011 the northbridge was assimilated into the CPU so there's no longer additional design and validation costs for ECC on the motherboard so long as it's the default. That left the CPU and memory.

      Then a couple of years ago AMD designed their memory controllers with ECC support built-in for every model proving it doesn't really cost anything extra to get it done on the CPU / memory hub side of things too.

      Finally, and that's where the relevant part of the rant comes in, the most recent memory production nodes ended up so noisy that memory manufacturers are being forced to use ECC internally on their controllers anyhow. They even put it into their standard specs. So, what's happening now is that all the chips (pardon the pan) are in place and there's nothing BoM wise preventing from mass market ECC adoption. That is, except for Intel's market segmentation...

      So, with AMD in the game, it's finally a fight worth fighting over for Linus. But that's only been true for the last couple of years really.

      --
      compiling...
      • (Score: 1) by shrewdsheep on Wednesday January 06 2021, @03:18PM (5 children)

        by shrewdsheep (5215) on Wednesday January 06 2021, @03:18PM (#1095637)

        If there is transparent ECC chekcing (;-) going on already, why does it have to be exposed explicitly? It would, at best, additionally check some part of the bus-transfer. I believe this would be a fine solution. The RAM controller can always expose error rates on a side channel. Could you reference some RAM modules with transparent, internal, ECC?

        • (Score: 3, Informative) by RS3 on Wednesday January 06 2021, @03:59PM

          by RS3 (6367) on Wednesday January 06 2021, @03:59PM (#1095658)

          Here are some chips: https://www.intelligentmemory.com/ECC-DRAM/DDR3/ [intelligentmemory.com]

          I'll research it some more for you and find out who uses them in DIMM modules.

          The next step will be to find out if ECC correction info / stats are available on the i2c bus.

        • (Score: 4, Interesting) by RamiK on Wednesday January 06 2021, @09:23PM (3 children)

          by RamiK (1813) on Wednesday January 06 2021, @09:23PM (#1095813)

          Could you reference some RAM modules with transparent, internal, ECC?

          To my knowledge, all DDR5 DIMMs will have on-die ECC error correction.

          why does it have to be exposed explicitly?

          It's discussed here:

          We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM.

          ( https://ieeexplore.ieee.org/document/7551405 [ieee.org] )

          --
          compiling...
          • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @10:03PM (2 children)

            by Anonymous Coward on Wednesday January 06 2021, @10:03PM (#1095847)

            Where is the dumbed down version? Car analogy?

            • (Score: 0) by Anonymous Coward on Friday January 08 2021, @02:01AM (1 child)

              by Anonymous Coward on Friday January 08 2021, @02:01AM (#1096796)

              To put some background and explain it in English because I couldn't come up with a good analogy.

              ECC is usually done as SECDED (single error correct, dual error detect). This means that all errors that are single bits can be corrected and all errors that are two bits can be detected but not necessarily fixed. Chipkill is a higher standard that is SECDED plus the additional guarantee that all errors introduced by a single memory chip regardless of size are correctable and result in a detectable error in the face of any other "single" errors on other chips. Basically, it makes it so that all errors on a single chip can be treated as a single bit error regardless of how big it actually is. Additionally, most errors that occur are caused by bad hardware (82- 92%) and not random errors. As a result, most errors are confined to a single chip.

              Now, under a regular ECC system dealing with purely random errors, whether or not the user actually sees the errors wouldn't make a difference. They pop up, are fixed, and there is nothing you can do about it anyway precisely because they are random. This is not the case with bad hardware. By not exposing errors to the OS and user that are detected, the system is unable reach the same ability to correct the situation of a bad chip as the Chipkill system. Therefore, you can only get the benefit of fixing random errors, which are the minority as mentioned. However, if you do signal the OS about the errors, they are able to replace bad chips. Therefore get the majority of the benefit of Chipkill (but not 100% because you still get errors until replacement occurs which is not the case with Chipkill) without the performance drawbacks.

              • (Score: 0) by Anonymous Coward on Friday January 08 2021, @02:09AM

                by Anonymous Coward on Friday January 08 2021, @02:09AM (#1096800)

                And of course I screwed it up by not proofreading. The OS can replace the bad DATA not chips while the user can replace bad chips.

    • (Score: -1, Offtopic) by Anonymous Coward on Wednesday January 06 2021, @10:54AM

      by Anonymous Coward on Wednesday January 06 2021, @10:54AM (#1095566)

      Or, perhaps, you could regulate your RAM, according to the Second Amendment, Arik? With barrel regulation, or ECC under Federal Militia Rules?

    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @03:11PM

      by Anonymous Coward on Wednesday January 06 2021, @03:11PM (#1095633)

      free markets are just as real as rainbow farting unicorns.

    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @10:10PM

      by Anonymous Coward on Wednesday January 06 2021, @10:10PM (#1095852)

      I've been ranting about this for more than 2 decades.

      Hm..., something comes to mind.

      https://xkcd.com/725/ [xkcd.com]

  • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @04:13AM (6 children)

    by Anonymous Coward on Wednesday January 06 2021, @04:13AM (#1095458)

    In software, "shit happens" has been the accepted notion. We sent out update when we were forced to.

    Hardware used to be different. But that was like half a century ago. Hardware is as flaky as software, and that's been the case for a few decades now.

    • (Score: 3, Insightful) by canopic jug on Wednesday January 06 2021, @05:48AM (5 children)

      by canopic jug (3949) Subscriber Badge on Wednesday January 06 2021, @05:48AM (#1095502) Journal

      In softwareMicrosoft products, "shit happens" has been the accepted notion. We sent out update when we were forced to.

      Hardware used to be different. But that was like half a century ago. Hardware is as flaky as software, and that's been the case for a few decades now.

      There. Fixed that for you. The problem has not been software but rather Microsoft products. Over time that has translated to a general expectation of bad engineering in software and to an acceptance of bad design everywhere else. With both computer software and hardware that has become an expectation. Everywhere else, it has merely become accepted, much to the threat of our continued survival as a society. Bill Gates' most lasting legacy, if there is a civilization left after a few years, will be that he made bad engineering acceptable.

      --
      Money is not free speech. Elections should not be auctions.
      • (Score: 2) by Immerman on Wednesday January 06 2021, @06:17AM (4 children)

        by Immerman (3985) on Wednesday January 06 2021, @06:17AM (#1095510)

        Really? Microsoft is particularly bad, but I can't say that I've ever used flawless software. And I don't think it's just a problem with expectations - software engineering is HARD. A typical car only has around 1800 parts, and cars always have their faults, because engineers aren't perfect.

        If you take a single line of code as very roughly comparable in design complexity to a single car part (ranging from a single bolt out to a complex cast manifold), that means a single large scale piece of software can have around 1,000-5,000x as many "parts" as a typical car, and in any sane world you would expect at *least* a similar increase in flaws.

        • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @06:22AM (2 children)

          by Anonymous Coward on Wednesday January 06 2021, @06:22AM (#1095513)

          > Microsoft is particularly bad, but I can't say that I've ever used flawless software.

          On Linux they call this User Error or WONTFIX.

          • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @01:50PM (1 child)

            by Anonymous Coward on Wednesday January 06 2021, @01:50PM (#1095606)

            On Linux they call this User Error or WONTFIX.

            On Linux, they call it "unreproducible". More seriously, with free software, the user can actually fix the damn problem too.

            • (Score: 2) by barbara hudson on Wednesday January 06 2021, @03:54PM

              by barbara hudson (6443) <barbara.Jane.hudson@icloud.com> on Wednesday January 06 2021, @03:54PM (#1095655) Journal

              The user can fix a flipped bit on Linux? Don't think so.

              When you get weird errors , and you swap the ram, and it goes away, problem solved.

              There have even been cases where one stick will be more susceptible to EM interference from the power supply, and swapping slots so the vulnerable stick is now furthest from the power supplyfixes the problem.

              --
              SoylentNews is social media. Says so right in the slogan. Soylentnews is people, not tech.
        • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @03:46PM

          by Anonymous Coward on Wednesday January 06 2021, @03:46PM (#1095650)

          Yep, ACPI was a particular mess because MS refused to enforce Intel's rules about how the DSDT was written. You'd have systems shipping with DSDT so badly written that they would not compile at all on the reference compiler that companies were supposed to be using, but MS would include special code to tolerate it rather than just allowing the system to have working ACPI support. I remember having to dump and recompile it myself because the manufacturer was too lazy to properly code it. The thing is that in that case, it wouldn't have even cost them any development time as the core logic was correct, there were just a few minor things they chose not to do correctly. No programming knowledge necessary.

          Going back to the bad old Wintel days, MS has used manufacturer laziness to make it hard for any other OS to compete with them. So, this isn't exactly new news or particularly shocking for those paying attention. Not only were those Wintel modems less functional, but you had to pay a premium for having some of the working offloaded onto the CPU.

  • (Score: 3, Informative) by dltaylor on Wednesday January 06 2021, @04:55AM (5 children)

    by dltaylor (4693) on Wednesday January 06 2021, @04:55AM (#1095481)

    I have been running ECC since "like, forever". It has caught a few glitches over the years. My soon to be replaced (I hope) desktops are old Dell servers, with much (PSU, disks, graphics) replaced. They have Xeons (X5570) and ECC memory, and everything that I do where bit rot might matter is done on them.

    I also have a couple of old Dell laptops with i7s, which I mostly use web browsing and checking email on the road.

    A couple of Ryzens in motherboard supporting ECC, are high on the list as a winter purchase. I do not game (or mine bitcoin), so I can re-use the current desktops' graphics. I still have the original cards, so I can put the servers back to their "intended" use.

    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @05:56AM

      by Anonymous Coward on Wednesday January 06 2021, @05:56AM (#1095506)

      Beware of Arik- your post, which is pretty similar to mine, will set him off. Otherwise thanks for your excellent post.

    • (Score: 1, Funny) by Anonymous Coward on Wednesday January 06 2021, @06:24AM (2 children)

      by Anonymous Coward on Wednesday January 06 2021, @06:24AM (#1095515)

      I saved a fortune on ECC RAM by wrapping the sticks in tin foil. No Errs so far.

      • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @11:56AM (1 child)

        by Anonymous Coward on Wednesday January 06 2021, @11:56AM (#1095574)

        How about lead as a heat sink and shield? How much does it take to shield gammas et al? Maybe Carbon matte under it? Hell I don't know. It's got to be easier than all that ECC circuitry and code.
        Seams the KISS principle is best here. (Keep It Simple Stupid, for those that don't know KISS)

        • (Score: 2) by DECbot on Wednesday January 06 2021, @08:43PM

          by DECbot (832) on Wednesday January 06 2021, @08:43PM (#1095785) Journal

          And here I thought KISS meant "what would Gene Simmons do?"

          --
          cats~$ sudo chown -R us /home/base
    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @11:04AM

      by Anonymous Coward on Wednesday January 06 2021, @11:04AM (#1095569)

      When you say 'glitches' do you mean hard errors or soft ones? Soft errors are fairly common but ECC automatically and silently corrects them. If you aren't watching your logs you won't notice them at all. Hard errors are the ones that ECC can't fix and result in program termination or forced reboots. (On non-ECC systems all memory errors are hard but usually go undetected, resulting in random corruption.)

  • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @06:51AM (6 children)

    by Anonymous Coward on Wednesday January 06 2021, @06:51AM (#1095526)

    CAN SOMEONE SAY WHAT ecc ram IS OR WAS - THANKS VERY MUCH

    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @07:10AM

      by Anonymous Coward on Wednesday January 06 2021, @07:10AM (#1095530)

      Error-correcting code memory

    • (Score: 2) by dltaylor on Wednesday January 06 2021, @09:53AM (3 children)

      by dltaylor (4693) on Wednesday January 06 2021, @09:53AM (#1095557)

      more completely: ECC is a TLA for Error Checking and Correcting

      There are extra bits in the data stream to/from the memory controller, which may be inside the CPU, as in Xeons, to the RAM. The extra bits allow for a code to be stored to the memory, and read back when the memory is accessed that can identify that the data read back is wrong. "Normally" these days (some specialized computers can do more) it allows for any single bit error to be identified, and corrected from the code, and some double bit errors. Back in the days of parity memory, all you knew was that the parity was bad, for single bit errors, but not which bit, so you couldn't fix it, and two flipped bits may have good parity for bad data.

      • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @10:58AM (2 children)

        by Anonymous Coward on Wednesday January 06 2021, @10:58AM (#1095567)

        OK, what is TLA? Is this some kind of military code to keep the civilians from knowing what they have planned for us? Like MIRV and MAD? And SNAFU, FUBAR, and BOHICA? So funny with their acronyms, the militaries are! Until you have been all you can be. That kinda sucks.

        • (Score: 2) by RS3 on Wednesday January 06 2021, @07:17PM

          by RS3 (6367) on Wednesday January 06 2021, @07:17PM (#1095726)

          TLA = Three Letter Acronym.

          Acronyms are a bit overused, IMHO.

        • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @07:20PM

          by Anonymous Coward on Wednesday January 06 2021, @07:20PM (#1095728)

          TLA is a TLA for Three Letter Acronym.

    • (Score: 3, Funny) by Anonymous Coward on Wednesday January 06 2021, @01:53PM

      by Anonymous Coward on Wednesday January 06 2021, @01:53PM (#1095607)

      CAN SOMEONE SAY WHAT ecc ram IS OR WAS - THANKS VERY MUCH

      It's the one that disables your CAPS LOCK ;)

  • (Score: 2) by bradley13 on Wednesday January 06 2021, @08:47AM (1 child)

    by bradley13 (3053) on Wednesday January 06 2021, @08:47AM (#1095545) Homepage Journal

    Since Intel has failed to support ECC for consumer machines, it's hard to justify buying ECC for personal use. The workstations that I have seen supporting ECC have been really poorly engineering. I remember a couple of Dell Precisions that sounded like jets waiting for takeoff - not something you really want sitting next to your desk. So I'm not sure just Intel is to blame - the whole PC manufacturing market has played along.

    But bit-rot definitely happens. One really obvious example: I once saw a presentation that included a live Excel sheet - and Excel had summed up a column of numbers incorrectly. When I pointed it out, at first no one believed me (computers don't make mistakes!). I persuaded the presenter to reload the sheet, and suddenly the total was different. It does happen, and can have genuine consequences. The performance hit is minor, if you are doing anything of actual importance.

    --
    Everyone is somebody else's weirdo.
    • (Score: -1, Troll) by Anonymous Coward on Wednesday January 06 2021, @09:04AM

      by Anonymous Coward on Wednesday January 06 2021, @09:04AM (#1095550)

      Maybe he had ECC and it caused the problem? You know like vaccines caused AIDS.

  • (Score: 3, Interesting) by Rich on Wednesday January 06 2021, @10:16AM (6 children)

    by Rich (945) on Wednesday January 06 2021, @10:16AM (#1095558) Journal

    Wow. That's the old Linus. "...these f*ckers happily sold broken hardware...". Yay for him! Did he forget his medications? Will he be cancelled now???

    Anyway, because most RAM sits idle most of the time in a "desktop" setting, I think it is a good idea to run soft ECC. Pages that sit idle for a few seconds will be checksummed and locked. On the next access, or at slow periodic intervals, they are checksummed again before being made accessible. While that will not be perfectly reliable, it will give a good idea with which likelihood a bit error occurs on a given machine. Being busy or sitting idle does not have any effect on the reliability of DRAM, save for Rowhammer-like events. It's mostly about where the random cosmic ray hits. I assume at least. And if it is not, statistics would indicate that, so all the better to have them. I think the kernel actually has such features, but it would be up to the distros to package them in an accessible way.

    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @07:23PM (2 children)

      by Anonymous Coward on Wednesday January 06 2021, @07:23PM (#1095731)

      I have no problems calling out people for stuff they have done.

      Especially if it is harmful, and doubly so if it is simply to make a profit.

      • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @07:39PM (1 child)

        by Anonymous Coward on Wednesday January 06 2021, @07:39PM (#1095741)

        Thank you Linus, for gracing our humble discussion. :)

        • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @07:51PM

          by Anonymous Coward on Wednesday January 06 2021, @07:51PM (#1095749)

          I <3 Linux Torvalds.

    • (Score: 2) by RS3 on Wednesday January 06 2021, @07:37PM (1 child)

      by RS3 (6367) on Wednesday January 06 2021, @07:37PM (#1095740)

      Re Linus: he's been ranting all along; he has toned down the direct attacks on individuals.

      I like your "soft ECC" idea. Should be pretty easy to implement something to do just that.

      Along those lines, thinking about "rowhammer" problem, what if RAM was refreshed more often? Should thwart rowhammer and other RAM weaknesses. Might even reduce susceptibility to bit-flips (because it would "catch" the bit state sooner, before more charge has leaked away.)

      For anyone who doesn't know but is interested, the main RAM in most computers is "dynamic RAM". Which, like too many technical terms, doesn't correctly convey the message. The other type is "static RAM", which is essentially "flip-flops" - transistor pairs that store 1 bit per pair (and really more like 4-6 transistors per bit), and maintain the bit state as long as power is applied. But each bit "cell" in static RAM is much bigger than dynamic RAM, so to get the RAM address size without the physical size, they came up with Dynamic RAM.

      Dynamic RAM is an array of capacitors, each of which holds an electron charge to represent a 1 or 0 state. But capacitor charge "leaks", so dynamic RAM must be "refreshed" continuously. Basically read and write back the same data. Here's a good writeup: https://en.wikipedia.org/wiki/Memory_refresh [wikipedia.org]

      • (Score: 3, Interesting) by Rich on Wednesday January 06 2021, @08:43PM

        by Rich (945) on Wednesday January 06 2021, @08:43PM (#1095784) Journal

        1.) referring to Intel as "these f*ckers" is pretty classic, i'd say. But don't get me wrong, everyone, without old Linus, Linux is eventually destined to die by committee.

        2.) You can check your hypothesis with an experiment: set up a rowhammer test, see if it works on a specific row, and then repeat the experiment with an access to the row being hammered right before hammering. Any row access will cause a refresh (or it did, in the old days; they could have introduced a cache somewhere past DDR3 or so). So the hammer will hit the row in freshly refreshed state.

        3.) The Soft-ECC could run with different performance settings. Most aggressive would work as described and flush out modified pages very fast, maybe even with a worst case limit for the number of pages (say 1% for at least 99% safety) to be "hot". This could also keep correction codes. Most lenient would flush much slower, time based, and slowly idle through the background, with a hint to check soon if any page was read from. That would be good enough for at least "you just got garbage, but we can't do anything about it" without significant performance penalty. A balance between those might be a memory bandwidth throttle (say a 50 out of 5000 MB/s limit). It would be in the "most aggressive" mode when the computer waits for a key to be typed, as it does 99% of the time, and go out of the way when there's really computing to be done. I'd be curious what results this would have in a typical desktop setting. It could also be coupled with a "RAM Doubler" style compressing VM to help the 4 GB SBC class just a tiny bit with everyday tasks. Not that I'd actively encourage that, usually you hear me quote Seymour Cray on that topic ("Virtual memory results in virtual performance").

    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @08:52PM

      by Anonymous Coward on Wednesday January 06 2021, @08:52PM (#1095794)

      it's ok because they are not diversity hires/outreachy interns.

  • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @12:57PM (1 child)

    by Anonymous Coward on Wednesday January 06 2021, @12:57PM (#1095592)

    Google did some statistics on the ECC usage in their servers years ago already. They found on average ~4k corrected memory corruptions per year, per DIMM. Most of those won't impact the functioning of the system but any system that is running for a longer time may get affected at some point.

    More details here: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf [toronto.edu]

    BTW, another worrying example of a memory corruption effect in 2003 alreay: https://www.vice.com/en_us/article/9agbxd/space-weather-cosmic-rays-voting-aaas [vice.com] (probably of interest to some US people)

    • (Score: 0) by Anonymous Coward on Wednesday January 06 2021, @02:50PM

      by Anonymous Coward on Wednesday January 06 2021, @02:50PM (#1095625)

      It is partly about odds, but that is not the whole story.

      There is an argument that says the odds of a random bit flip are pretty low. In fact, so low that ECC is not worth it's cost.

      BUT rowhammer is not a random thing, making the odds misleading. Running without ECC (or at least parity?) opens a useful attack surface.

      This area seems an old subject dating to the "parity is for farmers" story for the 6600/7600.

      From https://en.wikipedia.org/wiki/ECC_memory [wikipedia.org]

      Seymour Cray famously said "parity is for farmers" when asked why he left this out of the CDC 6600.[11] Later, he included parity in the CDC 7600, which caused pundits to remark that "apparently a lot of farmers buy computers". The original IBM PC and all PCs until the early 1990s used parity checking.[12] Later ones mostly did not.

         

(1)