Linux 6.1 Will Make It A Bit Easier To Help Spot Faulty CPUs:
While mostly of benefit to server administrators with large fleets of hardware, Linux 6.1 aims to make it easier to help spot problematic CPUs/cores by reporting the likely socket and core when a segmentation fault occurs, which can help in spotting any trends if routinely finding the same CPU/core is causing problems.
Queued up now in TIP's x86/cpu branch for the Linux 6.1 merge window in October is a patch to print the likely CPU at segmentation fault time. Printing the likely CPU core and socket when a seg fault occurs can be beneficial if routinely finding seg faults happening on the same CPU package or particular core.
In a large enough fleet of computers, it is common to have a few bad CPUs. Those can often be identified by seeing that some commonly run kernel code, which runs fine everywhere else, keeps crashing on the same CPU core on one particular bad system.
However, the failure modes in CPUs that have gone bad over the years are often oddly specific, and the only bad behavior seen might be segfaults in programs like bash, python, or various system daemons that run fine everywhere else.
Add a printk() to show_signal_msg() to print the CPU, core, and socket at segfault time.
This is not perfect, since the task might get rescheduled on another CPU between when the fault hit, and when the message is printed, but in practice this has been good enough to help people identify several bad CPU cores.
This little helper to assist in spotting potentially faulty processors will be there for use starting on Linux 6.1 later this year.
(Score: 3, Insightful) by coolgopher on Tuesday August 30 2022, @01:11AM
I'm surprised this hasn't been present since ages. Basic trouble-shooting methodology is to attempt to pinpoint the failing location, even if it's by statistical means.
(Score: 4, Insightful) by shrewdsheep on Tuesday August 30 2022, @07:19AM (1 child)
It is of course nice to read about the many new shiny features of new linux kernels. OTOH I like the idea that linux still runs on quite old hardware. With kernel images reaching into several Mb this will become increasingly difficult. I recently installed linux on a 586 class machine. The boot failed and it turned out that so called unstable "time stamp counting" was the culprit. I have no idea what this is, presumably it matches up timing of different cores (which the machine hadn't), but the code was there and running. I hope the kernel keeps being modular so it can be stripped down to make it fit on old hardware.
Does anybody whether the minimal kernel size is recorded somewhere?
(Score: 5, Interesting) by Rich on Tuesday August 30 2022, @11:36AM
Indeed. The Kernel has been hijacked by the big money data center crowd and other sinister actors. Not a surprise given they offer industry rates, while the hobbyists get insults from internet retards as compensation. But still, I wonder if some sort of fresh start or fork for general purpose computing would be in order.
I don't see where tweaking IO scheduling, uploading turing complete firewall filters, or a full boot chain certificate management to lock out root from /proc/mem might help in everyday use, yet they add a lot of complexity. One can disable much of this, but the kernel will still be a megabyte-sized executable, while it should be measurable in kilobytes, like embedded RTOS kernels are still today. Likely an unattainable dream, though, once writing NTFS support, ACPI bug mitigation for working laptop sleep, and all that are built in.
Oh. And I want cold power-on boot times under a second. Nothing modern hardware could not do.