A further problem is that all that compiling puts a full load on my computer for several hours, and sometimes drives it to overheating and locking up.
Is this a problem with the heatsink/fan assembly? What CPU is it? Nowadays they tend to be pretty good at throttling the clock to prevent overheating. Maybe it's a power supply problem?
(Score: 1, Informative) by Anonymous Coward on Tuesday April 22, @10:00PM
(12 children)
by Anonymous Coward
on Tuesday April 22, @10:00PM (#1401194)
I'm with you on this. There is a good chance they don't have thermal throttling set. Modern CPUs will self-regulate their boosting down to 100%, but not below that amount unless they have a profile that allows it to underclock under load. That profile isn't the default, so the kernel needs to tell the hardware that is OK and most distros and kernels I know of aren't configured to do that. It could also be a power issue because compiling is known to be extra hard on CPUs and require more power than the baseline load at the same utilization. That is the reason why having a machine that you only use for compiling and another for working (or having one computer that is basically disposable) is the standard for people working on big projects.
The computer is a fanless desktop from silentpc.com. Was very expensive, but I wanted the silence. Ryzen 5600G CPU.
Seems to have no problem with 2 hours of sustained compiling. Even 4 hours is often okay. Longer than that eventually brings it to a boil, so to speak.
If I have all 6 cores doing compiling, and I fire up some game that engages the integrated 3D accelerated graphics (I don't have a dedicated graphics card, owing to them being extremely expensive at the time I got the PC, during the pandemic), then I can overheat it in perhaps 30 minutes.
(Score: 1, Insightful) by Anonymous Coward on Wednesday April 23, @04:25AM
(4 children)
by Anonymous Coward
on Wednesday April 23, @04:25AM (#1401217)
Do you use a profile that allows thermal throttling below the rated speed under load? Which CPU governor do you use? If it is cooking itself after 4 or so hours, then it probably isn't the power supply (unless that is undervolting due to overheating) but the thermal design. The problem for you is that each time the processor overheats to the point it exceeds the true maximum junction temperature, that temperature falls by a random but chaotic amount for a given voltage. So I'd check which governors and thermal controls you are using to help mitigate that if it is a problem for you. The kernel can be told to all sorts of things, including automatic underclocking and idle looping, to keep temperatures within user constraints. But you have to tell it that you want it to do that first.
I confess I have never looked into this. I have no idea if a CPU governor is being used. But it sure sounds like a good idea. However, a bit of searching for info on this matter brought up a lot of docs to read. Was hoping for a simple, quick solution, along the lines of "echo something > /dev/something"
(Score: 1, Informative) by Anonymous Coward on Thursday April 24, @05:32AM
(2 children)
by Anonymous Coward
on Thursday April 24, @05:32AM (#1401350)
The easiest way to potentially solve it is to issue the command (cpufreq-set -g powersave) or (cpupower frequency-set -g powersave) which will cause the CPU to use only the minimum allowed speed regardless of load. Otherwise, you can use that tool to experiment on a CPU speed that will not overheat. It will slow everything at the cost of almost ensuring no ability to overheat until you next reboot. There are also a number of daemons you can use to control it based on your platform and requirements. Sadly there isn't an easy answer because what works for one system doesn't work for another. And part of the problem is that, since it appears that you have exceeded the maximum temperature before, the overheat protection may not be aggressive enough due to the lower temperature where the processor will fail now.
And as a frank side note: you'd think a fanless PC manufacturer would have better documentation on how to configure their servers in this manner.
(Score: 0) by Anonymous Coward on Thursday April 24, @04:36PM
(1 child)
by Anonymous Coward
on Thursday April 24, @04:36PM (#1401400)
Setting CPU governor to powersave is easy but it might not work. During a really long compile the heat will build up and a fanless system can't clear it out. You can't cool a CPU with hot air.
Lowering the CPU thermal throttle temperature will probably help more, which you can do with the ryzenadj tool.
Realistically though, a fanless system just isn't a great choice for long sustained workloads. For silent, the best approach is water cooling open loop with a big radiator and fans that can throttle down to silent speed. Not really viable for a laptop but gives you silent 90% of the time and max performance (and still not very loud) the other 10% of the time.
(Score: 0) by Anonymous Coward on Thursday April 24, @10:49PM
by Anonymous Coward
on Thursday April 24, @10:49PM (#1401422)
It is a tradeoff. Lowering the CPU using ryzenadj vs the governor should affect the same settings under load. The difference is powersave is simpler at the expense of not having to do too much tuning and experimentation. Coming up with a complete thermal profile would be best. In the end, the solution will probably include a mix of hardware and kernel tuning. Right now, the APU is cooking itself, which means the throttling is already being exceeded. At a minimum the APU is signaling the platform to shutdown (either hard or soft) and it is a sign that the maximum junction temp is being exceeded and therefore lowered. That means that the built-in cooling profile is unreliable and that the kernel probably needs to get involved by actively cooling through injected idle loops.
There is a good chance they don't have thermal throttling set. Modern CPUs will self-regulate their boosting down to 100%, but not below that amount unless they have a profile that allows it to underclock under load. That profile isn't the default, so the kernel needs to tell the hardware that is OK and most distros and kernels I know of aren't configured to do that.
So I'd check which governors and thermal controls you are using to help mitigate that if it is a problem for you. The kernel can be told to all sorts of things, including automatic underclocking and idle looping, to keep temperatures within user constraints. But you have to tell it that you want it to do that first.
Was hoping for a simple, quick solution, along the lines of "echo something > /dev/something"
The easiest way to potentially solve it is to issue the command (cpufreq-set -g powersave) or (cpupower frequency-set -g powersave) which will cause the CPU to use only the minimum allowed speed regardless of load. Otherwise, you can use that tool to experiment on a CPU speed that will not overheat. It will slow everything at the cost of almost ensuring no ability to overheat until you next reboot. There are also a number of daemons you can use to control it based on your platform and requirements. Sadly there isn't an easy answer because what works for one system doesn't work for another.
Setting CPU governor to powersave is easy but it might not work.
[...]
Lowering the CPU thermal throttle temperature will probably help more, which you can do with the ryzenadj tool.
the kernel probably needs to get involved by actively cooling through injected idle loops.
So is temperature management another black art? Like Linux audio also seems to be?
How does one do all those things? How does one even find out how to do these things? How does one even find out what can be done?
(Score: 1, Insightful) by Anonymous Coward on Saturday April 26, @01:09AM
(2 children)
by Anonymous Coward
on Saturday April 26, @01:09AM (#1401603)
Is it a bit of a black art? Yes. Sadly, like audio, that is a bit by design because complexity slowly increased over time by without actually doing a clean redesign. Starting with the fact that you have multiple hardware manufacturers doing multiple (and often incompatible) things even between their own products. Next is the assemblers that put those components together in different combinations with different designs. Then there are OSes that do different things with the same settings. Additionally, you have users that want different things from identical platforms. Finally, most people don't have to actively do anything because it usually just works but when it doesn't, you need serious options.
So how does one learn these things? I'm not really sure. I've had the benefit of being in the industry as these things cropped up. Adding a new piece into the picture you've already assembled is easy. Another benefit is that you really only need to do thermal design when you are designing a platform helps too because you usually have someone else's work to start with. I think the best way to learn is by looking at an OEM install or other professional design. Or you could look at what sort of things a distro like Debian or Fedora do on default hardware. Examine the power management profiles and tables, check their daemon configuration, look at udev rules, and browse the applicable sysfs entries for things like thermal and hwmon. See how they handle it and you can get a picture of what works and how it fits together.
multiple hardware manufacturers doing multiple (and often incompatible) things even between their own products.
Looked up modem cards one time to see if my card was working: EVERY card manufacturer blinks their lights differently even in their own products.
The card is blinking: one light green the other a steady yellow? I figured it might be receiving but not transmitting... but no: it was fine. Another card? It might mean there was a problem, it might not.
Steady green or yellow? Blinking green or yellow? Some random combination of the two? Not blinking at all? You have to look up EVERY SINGLE CARD to look at it's specs to see what is going on.
F*ck it... it wasn't working so i replaced it. Teh new one blinked or not in some combination... dunno...but this one worked, so....
SHEEEEESH!
-- ---
Please remind me if I haven't been civil to you: I'm channeling MDC. I have always been here.
---Gaaark 2.0
--
(Score: 0) by Anonymous Coward on Sunday May 11, @01:56AM
by Anonymous Coward
on Sunday May 11, @01:56AM (#1403356)
We had a switch where a green LED was normal and red was an error. Except for one hardware version. There, red was normal and green was an error. Next version they switched the colors back. The story told to us by our support rep was that they changed two-color LEDs and no one realized that it had the opposite pin out. Rather than eat the cost or get into a huge fight, the OEM just changed their documentation for the bad units as a new revision. It was a pain to scan the lights because you had to remember which switch was which revision. Finally we figured out how many "red LEDs" we would have and our redundancy needs, we started putting them in specific places and marking cabinets in with red painter's tape so the mental load was lower. Ended up saving a ton of money because they had a hard time selling that revision to other customers.
Moral of the story. Sometimes double checking your design can save your company hundreds of thousands of dollars down the road.
(Score: 3, Informative) by turgid on Tuesday April 22, @07:03PM (13 children)
A further problem is that all that compiling puts a full load on my computer for several hours, and sometimes drives it to overheating and locking up.
Is this a problem with the heatsink/fan assembly? What CPU is it? Nowadays they tend to be pretty good at throttling the clock to prevent overheating. Maybe it's a power supply problem?
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 1, Informative) by Anonymous Coward on Tuesday April 22, @10:00PM (12 children)
I'm with you on this. There is a good chance they don't have thermal throttling set. Modern CPUs will self-regulate their boosting down to 100%, but not below that amount unless they have a profile that allows it to underclock under load. That profile isn't the default, so the kernel needs to tell the hardware that is OK and most distros and kernels I know of aren't configured to do that. It could also be a power issue because compiling is known to be extra hard on CPUs and require more power than the baseline load at the same utilization. That is the reason why having a machine that you only use for compiling and another for working (or having one computer that is basically disposable) is the standard for people working on big projects.
(Score: 3, Interesting) by bzipitidoo on Tuesday April 22, @11:45PM (5 children)
The computer is a fanless desktop from silentpc.com. Was very expensive, but I wanted the silence. Ryzen 5600G CPU.
Seems to have no problem with 2 hours of sustained compiling. Even 4 hours is often okay. Longer than that eventually brings it to a boil, so to speak.
If I have all 6 cores doing compiling, and I fire up some game that engages the integrated 3D accelerated graphics (I don't have a dedicated graphics card, owing to them being extremely expensive at the time I got the PC, during the pandemic), then I can overheat it in perhaps 30 minutes.
(Score: 1, Insightful) by Anonymous Coward on Wednesday April 23, @04:25AM (4 children)
Do you use a profile that allows thermal throttling below the rated speed under load? Which CPU governor do you use? If it is cooking itself after 4 or so hours, then it probably isn't the power supply (unless that is undervolting due to overheating) but the thermal design. The problem for you is that each time the processor overheats to the point it exceeds the true maximum junction temperature, that temperature falls by a random but chaotic amount for a given voltage. So I'd check which governors and thermal controls you are using to help mitigate that if it is a problem for you. The kernel can be told to all sorts of things, including automatic underclocking and idle looping, to keep temperatures within user constraints. But you have to tell it that you want it to do that first.
(Score: 2) by bzipitidoo on Thursday April 24, @03:25AM (3 children)
I confess I have never looked into this. I have no idea if a CPU governor is being used. But it sure sounds like a good idea. However, a bit of searching for info on this matter brought up a lot of docs to read. Was hoping for a simple, quick solution, along the lines of "echo something > /dev/something"
(Score: 1, Informative) by Anonymous Coward on Thursday April 24, @05:32AM (2 children)
The easiest way to potentially solve it is to issue the command (cpufreq-set -g powersave) or (cpupower frequency-set -g powersave) which will cause the CPU to use only the minimum allowed speed regardless of load. Otherwise, you can use that tool to experiment on a CPU speed that will not overheat. It will slow everything at the cost of almost ensuring no ability to overheat until you next reboot. There are also a number of daemons you can use to control it based on your platform and requirements. Sadly there isn't an easy answer because what works for one system doesn't work for another. And part of the problem is that, since it appears that you have exceeded the maximum temperature before, the overheat protection may not be aggressive enough due to the lower temperature where the processor will fail now.
And as a frank side note: you'd think a fanless PC manufacturer would have better documentation on how to configure their servers in this manner.
(Score: 0) by Anonymous Coward on Thursday April 24, @04:36PM (1 child)
Setting CPU governor to powersave is easy but it might not work. During a really long compile the heat will build up and a fanless system can't clear it out. You can't cool a CPU with hot air.
Lowering the CPU thermal throttle temperature will probably help more, which you can do with the ryzenadj tool.
Realistically though, a fanless system just isn't a great choice for long sustained workloads. For silent, the best approach is water cooling open loop with a big radiator and fans that can throttle down to silent speed. Not really viable for a laptop but gives you silent 90% of the time and max performance (and still not very loud) the other 10% of the time.
(Score: 0) by Anonymous Coward on Thursday April 24, @10:49PM
It is a tradeoff. Lowering the CPU using ryzenadj vs the governor should affect the same settings under load. The difference is powersave is simpler at the expense of not having to do too much tuning and experimentation. Coming up with a complete thermal profile would be best. In the end, the solution will probably include a mix of hardware and kernel tuning. Right now, the APU is cooking itself, which means the throttling is already being exceeded. At a minimum the APU is signaling the platform to shutdown (either hard or soft) and it is a sign that the maximum junction temp is being exceeded and therefore lowered. That means that the built-in cooling profile is unreliable and that the kernel probably needs to get involved by actively cooling through injected idle loops.
(Score: 2) by hendrikboom on Friday April 25, @06:44PM (5 children)
So is temperature management another black art? Like Linux audio also seems to be?
How does one do all those things? How does one even find out how to do these things? How does one even find out what can be done?
(Score: 2) by turgid on Friday April 25, @07:25PM (1 child)
Settle down for a long night with the Linux kernel configuration menus?
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 3, Touché) by Gaaark on Thursday May 01, @04:17PM
Geez i remember those days: haven't done a kernel config in ... decade & half?...longer?...
...then the compiling...
The good ol' days, lol.
--- Please remind me if I haven't been civil to you: I'm channeling MDC. I have always been here. ---Gaaark 2.0 --
(Score: 1, Insightful) by Anonymous Coward on Saturday April 26, @01:09AM (2 children)
Is it a bit of a black art? Yes. Sadly, like audio, that is a bit by design because complexity slowly increased over time by without actually doing a clean redesign. Starting with the fact that you have multiple hardware manufacturers doing multiple (and often incompatible) things even between their own products. Next is the assemblers that put those components together in different combinations with different designs. Then there are OSes that do different things with the same settings. Additionally, you have users that want different things from identical platforms. Finally, most people don't have to actively do anything because it usually just works but when it doesn't, you need serious options.
So how does one learn these things? I'm not really sure. I've had the benefit of being in the industry as these things cropped up. Adding a new piece into the picture you've already assembled is easy. Another benefit is that you really only need to do thermal design when you are designing a platform helps too because you usually have someone else's work to start with. I think the best way to learn is by looking at an OEM install or other professional design. Or you could look at what sort of things a distro like Debian or Fedora do on default hardware. Examine the power management profiles and tables, check their daemon configuration, look at udev rules, and browse the applicable sysfs entries for things like thermal and hwmon. See how they handle it and you can get a picture of what works and how it fits together.
(Score: 2) by Gaaark on Thursday May 01, @04:24PM (1 child)
Looked up modem cards one time to see if my card was working: EVERY card manufacturer blinks their lights differently even in their own products.
The card is blinking: one light green the other a steady yellow? I figured it might be receiving but not transmitting... but no: it was fine. Another card? It might mean there was a problem, it might not.
Steady green or yellow? Blinking green or yellow? Some random combination of the two? Not blinking at all?
You have to look up EVERY SINGLE CARD to look at it's specs to see what is going on.
F*ck it... it wasn't working so i replaced it. Teh new one blinked or not in some combination... dunno...but this one worked, so....
SHEEEEESH!
--- Please remind me if I haven't been civil to you: I'm channeling MDC. I have always been here. ---Gaaark 2.0 --
(Score: 0) by Anonymous Coward on Sunday May 11, @01:56AM
We had a switch where a green LED was normal and red was an error. Except for one hardware version. There, red was normal and green was an error. Next version they switched the colors back. The story told to us by our support rep was that they changed two-color LEDs and no one realized that it had the opposite pin out. Rather than eat the cost or get into a huge fight, the OEM just changed their documentation for the bad units as a new revision. It was a pain to scan the lights because you had to remember which switch was which revision. Finally we figured out how many "red LEDs" we would have and our redundancy needs, we started putting them in specific places and marking cabinets in with red painter's tape so the mental load was lower. Ended up saving a ton of money because they had a hard time selling that revision to other customers.
Moral of the story. Sometimes double checking your design can save your company hundreds of thousands of dollars down the road.