For most people, hardware problems and slow deliveries are annoying. But if you're the person behind the operating system that underpins much of the cloud, Android and IoT, your problems could easily become a big issue for lots of other people too.
Linux creator Linus Torvalds told a kernel contributor on Sunday that he's doing merges "very slowly" from one of his laptops as he waits for "new ECC memory DIMMS to arrive".
[...] "It was literally a DIMM going bad in my machine randomly after 2.5 years of it being perfectly stable. Go figure. Verified first by booting an old kernel, and then with memtest86+ overnight," he explains in a Linux kernel developer mailing list spotted by The Register.
[...] In early 2020, during the first wave of pandemic restrictions, Torvalds switched his main 'frankenbox' PC from one with an i9-9900k to one equipped with a monster 32-core AMD Threadripper 3970x-based processor. It was, as he said then, the first time in 15 years that his desktop wasn't Intel-based. As a consequence of moving off Intel, his 'allmodconfig' test builds accelerated by a factor of three.
[...] Torvalds last year took a swipe at Intel for its ECC memory policies. "Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt [with regards to] ECC. Seriously," he wrote.
Torvalds has also been using an Apple M1 silicon laptop for some development work, thanks to the Asahi Linux project, which has been working on bringnig the Arch Linux distro to Apple's M1 architecture.
(Score: 4, Interesting) by JoeMerchant on Wednesday October 12 2022, @05:53PM (4 children)
With a "frankenbox" equipped with "monster processors" you would expect some bleeding edge issues.
Back in the day, shortly before aux power connections to graphics cards became a thing, I bought the "recommended" CPU/GPU combo for Autodesk Inventor - a combo that apparently they had never tested much because: certain operations in Inventor (like rotation of a 3D view) would reliable cause the PC to hard-reset, losing everything since the last save operation. These certain operations were 100% reliable in causing the power flicker to the CPU- apparently that "monster" GPU put a load spike on the bus which took the system down. Thankfully, those operations only came up once every day or so of working - and when they did there were usually workarounds I could find that didn't spike the bus, but "save before rotate" became a regular operation of mine.
Україна досі не є частиною Росії Слава Україні🌻 https://news.stanford.edu/2023/02/17/will-russia-ukraine-war-end
(Score: 2) by vux984 on Wednesday October 12 2022, @07:58PM (3 children)
Could have been a PSU issue, or a combination with other peripherals too. Conceivably a thermal limit was being tripped too. Or a hardware fault like a bad capacitor on the motherboard...?? Tweaking BIOS settings to lower voltages or frequencies might also have worked around it, albeit at some performance cost.
"Thankfully, those operations only came up once every day or so of working"
Heh, the idea that you had to put up with it, and alter your workflow to work around instead of getting it resolved one way or another, is kind of mind blowing. Surely your time it wasted was worth more.
I bought some RAM once that didn't seem to agree with my mainboard... it met the specs, and when it was installed and running it passed all the memory self-tests and memtest etc, but the PC wouldn't boot reliably... froze up during boot up half the time. Hit reset when it did, usually once, sometimes two or three times, occasionally had to unplug it, but it always eventually worked. No issues once it was running, but it just didn't like booting up. Only affected me during reboots, so once a day or even less, like you, but even that was enough to drive me to replace it out of frustration.
(Score: 2) by JoeMerchant on Wednesday October 12 2022, @08:34PM (2 children)
>Conceivably a thermal limit was being tripped too
Not realistically. Normal crash I'd reload, go back to the same point and crash again after the exact same operation. Could also say "f-it" after a crash and go home, then reproduce the crash the next morning (after being powered off all night.)
>a hardware fault like a bad capacitor on the motherboard
If so, Inventor was the only software that ever tripped the problem.
>Tweaking BIOS settings to lower voltages or frequencies might also have worked around it
This was 2002ish, I forget if the recommended PC was a Gateway or what... whatever, it was one of those "mainstream" system packagers, with a top of the line Matrox GPU, IIRC. Mainstream BIOS in those days generally didn't get into overclockers' tweaking tools.
>Surely your time it wasted was worth more.
There's time wasted in the work-around, which -after practice- came down to about 3 minutes per 8 hours of heavy CAD work, and then there's time wasted in pursuing the fix, which can amount to a lost day or two if you're playing with motherboard replacements, etc. So, after about 3 months of heavy CAD work, say 3 hours of time wasted in work-around, the CAD work slacked off and I had other things to attend to... all in all not worth pursuing a fix when, in a year or two's time, a PC replacement/upgrade was likely to happen anyway.
Of course it was a big mystery at first, but it was a remarkably repeatable problem, and the Matrox slamming the system bus with a current demand spike is just about the only thing that made sense- made all the more sense when the next generation of graphics cards added power cable connections direct on the cards (requiring special for the time power supplies with those GPU supply cables included...)
Україна досі не є частиною Росії Слава Україні🌻 https://news.stanford.edu/2023/02/17/will-russia-ukraine-war-end
(Score: 3, Interesting) by vux984 on Wednesday October 12 2022, @10:35PM (1 child)
Oh, I suspect your diagnosis was essentially correct; that the extreme power draw of possibly both the CPU and GPU during a 3D rotation is the root cause. I'm just speculating that the extreme power draw wasn't actually out of spec or inherently the problem, and wouldn't have been an issue ... except that perhaps you had a dodgy power supply that got a little flakey when that spike hit and didn't deliver, or a dodgy cap on the motherboard that likewise only caused an issue during the spike, or a dodgy matrox card, or something. ie... If you'd gotten even a replacement workstation with the same specs it might have just been fine with the load?!
As for the thermal limit, I had a build once that ran fine all day and night, while working, while gaming, no problems ever, but it would fail almost immediately if i ran the prime95 test - dead in under a second. Ultimately resolved it by reseating the heatsink and replacing the thermal paste. It never had any issue except in prime95, but in hindsight, that was simply because nothing else i ran, or did, even games, at the time actually could max out all 4 cpu cores on that PC like prime95 did. Likewise your matrox situation -- it's possible, even likely that nothing else you did hit the system quite that hard. The fact that it booted up immediately after the incident, and only happened during the one operation doesn't really scream thermal event to me either to be honest, but if your theory is a huge spike in power draw caused it, that will also cause a huge spike in thermals... that power is ultimately just being converted to heat after all, and if its just a spike in power/heat that's causing problems, it doesn't take long for it to drop back down to within limits at all, if normal operation is well below the thermal limits. In my case the PC also booted up just fine after a prime95 meltdown.
"So, after about 3 months of heavy CAD work, say 3 hours of time wasted in work-around, the CAD work slacked off and I had other things to attend to."
Heh, i love that you did the math on this. :) For myself, I know the impact of the mere "distraction of knowing something was wrong" would have added up to enough time that it would likely have been worth fixing, but that's me. :)
(Score: 2) by JoeMerchant on Thursday October 13 2022, @12:54AM
CAD work came in spurts, and a new PC every two years was the norm back then. If it happened every hour I would have done something about it, but averaging 5x per week... Nah.
About thermal paste: I had a 2006 MacBook Pro that was in the (rather large) series of lots where they didn't apply the thermal paste to the GPU properly. When it would fault, the display would go black but everything else would continue to work. When new it ran hours before faulting, but as time went by that time would drop by 50% about every 6 months, by the time it was 4 years old it would never run for an hour without special cooling effort. Eventually it ran just long enough to start a DVD to disk backup operation, which would finish flawlessly some hours later, long after the screen went black. Eventually the new laptop was so much faster that the MacBook wasn't worth any effort at all... Come to think of it, I still need to dispose of that shiny aluminum piece of junk.
Україна досі не є частиною Росії Слава Україні🌻 https://news.stanford.edu/2023/02/17/will-russia-ukraine-war-end
(Score: 2) by wisnoskij on Wednesday October 12 2022, @07:32PM (11 children)
I thought ECC memory was rare and only used in the most specialized and mission critical of applications. You can just order ECC memory for a laptop?
(Score: 4, Interesting) by vux984 on Wednesday October 12 2022, @08:12PM (4 children)
ECC SODIMMs actually are available, and so are laptops that hold Xeon processors, and many recent AMD cpu's support ECC memory too, so it's probable, (although i don't know offhand) that some "workstation" laptops out there do properly support ECC memory.
In this case, honestly, I'd expect he's using non-ECC ram in the laptop while his desktop is down. Given he used non-ECC in his intel build desktops for decades prior it's not like he won't do a build without ECC. And he had the option of getting xeons and chose not to all those years because the price differential far exceeded the benefit even he thought ECC would bring.
So he just, rightfully, thinks ECC RAM is better, prefers to use it whenever he reasonably can, and thinks it should be more available in consumer hardware, and thanks to AMD lately, it actually is starting to become available to consumers, but really, in this market intel still dominates, so ECC won't be mainstream until intel releases its strangehold on it or until amd eclipses intel.
(Score: 2, Insightful) by liquibyte on Wednesday October 12 2022, @08:17PM (3 children)
Since ECC exists, all memory should be ECC by default, not the other way around.
(Score: 3, Insightful) by JoeMerchant on Wednesday October 12 2022, @08:38PM (2 children)
Should be is often the opposite of is be. ECC costs extra, yes it's just pennies but when you sell millions of copies those pennies add up, and more importantly: when those pennies differential reach consumers it's a 5% or more price differential, and that's enough to sway _most_ buyers to go for the cheaper alternative, and once _most_ buyers are doing the cheaper thing, the price gap widens even more because the lower volume option carries more overhead, less competition, etc.
Україна досі не є частиною Росії Слава Україні🌻 https://news.stanford.edu/2023/02/17/will-russia-ukraine-war-end
(Score: 1) by liquibyte on Friday October 14 2022, @06:11PM (1 child)
Those self same people will also justify the need for a thousand dollar phone to look at tik tok videos. My comment stands, all memory should by default be error checking regardless of the price differential.
(Score: 2) by JoeMerchant on Friday October 14 2022, @08:02PM
>Those self same people will also justify the need for a thousand dollar phone to look at tik tok videos.
Hey, I resemble that remark (a little)... my phone before last was a Motorola G Power ($149 - no contract), and something happened to the ($8) case so I started using it without the case and after a month I dropped it and cracked the screen. Would have just gotten another but they weren't conveniently available somehow, so I "cheaped out" and got the Moto G Play ($99 NC) and... damn, I've never had a slow phone before, but this one is just a dog - IMO completely without reason looking at the hardware specs, but something about it just really brings on the lag. Then it also has no Macro camera, which actually makes a difference to me a few dozen times a year... so, after 6-8 months of suffering with this dog, I found unlocked 2021 Moto G Powers on sale for: $149... and so, I'm actually replacing my phone before the last one died for the first time ever in my life. Now, what can I do with the G Play after it's not a daily driver....? I wonder if the Google Fi data only SIMs support hot spot... probably not.
Anyway, everybody has their priorities, and those thousand dollar tik tok scrollers gotta flash the fruit to keep up with their friends - the majority of them wouldn't want to know what ECC is or what it does, it would make them look smart in bad ways for their social standing. Would they pay an extra $50 to get the phone with ECC memory? Probably not. Marketing wonks have decided that they probably won't pay an extra $5 - either that, or they figure that by the time normal RAM goes flaky they'll be selling them a replacement phone, so: win win, lower costs up front AND faster replacement cycle.
Yes, I wish that ECC was at least an option, if not standard, on everything that uses RAM. Not holding my breath, though. We'll be lucky if USB-C actually gets standardized as the charging cord for the majority of our gadgets, before USB-D comes along.
Україна досі не є частиною Росії Слава Україні🌻 https://news.stanford.edu/2023/02/17/will-russia-ukraine-war-end
(Score: 0) by Anonymous Coward on Wednesday October 12 2022, @08:37PM (1 child)
From wikipedia: [wikipedia.org]
- 1 bit error per gigabyte of RAM per 1.8 hours
- 8% of DIMM memory modules were affected by errors per year
(Score: 5, Funny) by inertnet on Wednesday October 12 2022, @08:57PM
Tht's not too bad, mOst of my tex will be perectly fine then.
(Score: 2) by higuita on Thursday October 13 2022, @01:14AM (3 children)
ECC is important for everybody, after all, nobody (not even gamers and your parents) like to have crashes and even worse, corrupted files... AMD supported that for long, while not all MB implemented that, because intel refused to support ECC in desktop and laptops and a few dollars extra for the ECC support in the MB, plus in the ECC RAM compared with intel was always a problem.
Also, the older slower RAM speed and module size offset some error chance... and DOS and windows were always crashing anyway...
but time went by and ram speed is very high today, ram module much smaller and windows got better, and specially, linux proved that a OS can be stable, even in desktop. AMD is still supporting ECC and the new DDR5 ram do support on-die ECC, where it can detect and fix some memory errors inside the memory, but in transit (like the normal ECC)... probably it is really needed, but intel don't support it, so bake half solution on it until intel change their mind
Yes, ECC make everything that use it a little more expensive, but the price increase is small and allow things like this, linus detected his computer memory went bad, probably many ECC errors, trying to recover the corruption... without it, everything would start to slowly corrupt... even possibly the backups
(Score: 2) by Rich on Thursday October 13 2022, @08:30AM (2 children)
What baffles me is that Linux doesn't have an out-of-the-box software ECC. I've read numbers in the order of a bit per GB per hour lost. Even if such a software solution could not recover from all errors, it would give valuable statistics just how bad it its and whether a certain module is particularly affected on a given system.
There's a paper (master's thesis, the guy should be worthy of a PhD for doing that with a bit more of filler text...):
https://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf [mit.edu]
Money quote: "Preliminary measurements with an implementation of SoftECC in the JOS kernel on the x86 architecture show that SoftECC can halve the number of undetectable soft errors using minimal compute time."
(Score: 2) by higuita on Sunday October 16 2022, @05:57PM (1 child)
What are you saying, linux do support ECC for many years!
Linux kernel have edac for managing the ECC and it was added in 2007
That document is from 2005, so that document is WAY obsolete. Also, they talk about soft ECC, a fake, cpu based memory checksum, Linux EDAC is using hardware ECC
not only software ECC uses lot of cpu, it can only catch some errors. hardware ECC can catch much more. So software ECC was never implemented, if someone needs ECC, they should buy hardware that support it. Either server based intel, or a AMD desktop/workstaging/server, as intel refuse to support ECC in other than server and almost all AMD cpus do support ECC. Do not forget that you also need a MB and ram with ECC, so it is not just the cpu
(Score: 2) by higuita on Monday October 17 2022, @12:36PM
actually, it was 2006 the first official merge to the kernel and lived as a external patch for some years