Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Wednesday January 31 2018, @05:14PM   Printer-friendly
from the doesn't-raid-fix-this? dept.

Arthur T Knackerbracket has found the following story:

In 2015, Microsoft senior engineer Dan Luu forecast a bountiful harvest of chip bugs in the years ahead.

"We've seen at least two serious bugs in Intel CPUs in the last quarter, and it's almost certain there are more bugs lurking," he wrote. "There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we've moved past that."

Thanks to growing chip complexity, compounded by hardware virtualization, and reduced design validation efforts, Luu argued, the incidence of hardware problems could be expected to increase.

This month's Meltdown and Spectre security flaws that affect chip designs from AMD, Arm, and Intel to varying degrees support that claim. But there are many other examples.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Interesting) by requerdanos on Wednesday January 31 2018, @11:43PM (2 children)

    by requerdanos (5997) Subscriber Badge on Wednesday January 31 2018, @11:43PM (#631257) Journal

    Intel is taking a lot of heat lately, but all the first-run Ryzen processors from AMD have a bug that causes random segfaults, especially when compiling under linux (a not uncommon occurrence if one likes to linux).

    Here is an actual tech support letter I received from AMD. Some identifying information has been changed or obscured, otherwise it's 100% as I received it.

    Original Text
    From: TECH.SUPPORT@AMD.COM
    To: requerdanos@..............
    CC:
    Subject: RE: Ryzen/Linux segfault at 2f ip 0...

    Dear requerdanos,

    Your service request : SR #{ticketno:[######6680]} has been reviewed and updated.

    Response and Service Request History:

    Thank you for your email and background information about your issue. I’m sorry to hear that you’re experiencing stability issues with your system. Please be assured that I am here to help find a resolution to your problem

    At this time, I would like focus on your system’s hardware configuration. I need to collect some more information about your system which can help with our troubleshooting.

    Please provide the details of the following hardware components in your system:

            Make and model of motherboard

            Motherboard BIOS version

            Make and model of RAM

            Make and model of the power supply unit

    Please could you let me know the current settings you have for the CPU VCORE, SOC, and RAM? It would be very helpful if you could provide with pictures of your BIOS screens with these settings.

    In addition, through troubleshooting with other customers we have found that the layout of the components inside the system case have caused sub-optimal cooling of the CPU causing a variety of issues.

    I would like to better understand your system cooling to rule out any thermal issues. Please could you provide a picture of the whole interior of your system showing the CPU cooler?

    Also, could you let me know the reported CPU temperature during heavy load or when the errors occur?

    In order to update this service request, please respond, leaving the service request reference intact.

    Best regards,

    Asok

    AMD Global Customer Care

    That's right, their answer was basically "pics or it didn't happen." I am working to comply with their request. Also, they sent me this before linux 4.15 was released, wanting to know what temperature was reported--and 4.15 is the first kernel version to feature Ryzen CPU temperature reporting.

    Starting Score:    1  point
    Moderation   +2  
       Interesting=1, Informative=1, Total=2
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 0) by Anonymous Coward on Thursday February 01 2018, @10:56AM (1 child)

    by Anonymous Coward on Thursday February 01 2018, @10:56AM (#631414)

    That's right, their answer was basically "pics or it didn't happen."

    The way I read their answer is: "we need accurate information to be able to figure out what the problem is". Which makes perfect sense. Many people are highly inaccurate when giving descriptions of things that went wrong (which is perfectly normal and nothing to blame them for), and in complex systems that can easily mean the problem solver keeps looking in the wrong places and won't come near pinpointing and solving the problem. You need to help the problem solver to help you by being accurate, and this problem solver is helping you to be accurate by asking for pictures. Just work together for the best result.

    • (Score: 2) by requerdanos on Thursday February 01 2018, @05:45PM

      by requerdanos (5997) Subscriber Badge on Thursday February 01 2018, @05:45PM (#631579) Journal

      You need to help the problem solver to help you by being accurate, and this problem solver is helping you to be accurate

      There is a known CPU bug [amd.com] which manifests especially when compiling with gcc, and a simple test [github.com] for it, which I sent them the output of.

      Besides which my processor batch is known to have the bug, and I sent them that as well.

      There is not an issue in this particular instance with "well, things are complicated, and we need to be real accurate."

      I have a first-run CPU with a bug. AMD will replace them, but only if you complete their endurance course (instead of replacing them because they are defective and proven so).

      The temperature output they ask for might have relevant, but did not even exist when they asked for it--no available driver.

      The CPU does not have a bug because of the placement of internal components, because of what other hardware it's paired with, or the phase of the moon.

      It left the factory with that bug, I paid $500 for that buggy CPU, and here it is in my computer.

      There isn't much difference here between "Well, we need to carefully weigh all the factors" and "pics or it didn't happen."

      I am out $500 for a buggy CPU. I want a new one. Give me it. That is all the accuracy needed here.

      Meantime thank God this machine does not run Gentoo.