Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 19 submissions in the queue.
posted by cmn32480 on Wednesday March 02 2016, @02:12PM   Printer-friendly
from the give-it-a-tin-foil-hat dept.

A recent article in IEEE Spectrum looks into the challenges of building an exascale computer.

Al Geist works at at Oak Ridge National Laboratory and worries about... monsters hiding in the steel cabinets of the supercomputers that threatening to crash the largest computing machines on the planet.

[...] the second fastest supercomputer in the world in 2002, a machine called ASCI Q at Los Alamos National Laboratory. When it was first installed at the New Mexico lab, this computer couldn't run more than an hour or so without crashing. ...The problem was that an address bus on the microprocessors found in those servers was unprotected, meaning that there was no check to make sure the information carried on these within-chip signal lines did not become corrupted. And that's exactly what was happening when these chips were struck by cosmic radiation, the constant shower of particles that bombard Earth's atmosphere from outer space.

In the summer of 2003, Virginia Tech researchers built a large supercomputer out of 1,100 Apple Power Mac G5 computers. They called it Big Mac. To their dismay, they found that the failure rate was so high it was nearly impossible even to boot the whole system before it would crash.

The problem was that the Power Mac G5 did not have error-correcting code (ECC) memory, and cosmic ray–induced particles were changing so many values in memory that out of the 1,100 Mac G5 computers, one was always crashing.

[...] Just how many spurious bit flips are happening inside supercomputers already? To try to find out, researchers performed a study [PDF] in 2009 and 2010 on the then most powerful supercomputer—a Cray XT5 system at Oak Ridge, in Tennessee, called Jaguar.

Jaguar had 360 terabytes of main memory, all protected by ECC. I and others at the lab set it up to log every time a bit was flipped incorrectly in main memory. When I asked my computing colleagues elsewhere to guess how often Jaguar saw such a bit spontaneously change state, the typical estimate was about a hundred times a day. In fact, Jaguar was logging ECC errors at a rate of 350 per minute.

[Continues.]

Supercomputer operators have had to struggle with many other quirky faults as well. To take one example: The IBM Blue Gene/L system at Lawrence Livermore National Laboratory, in California, the largest computer in the world from 2004 to 2008, would frequently crash while running a simulation or produce erroneous results. After weeks of searching, the culprit was uncovered: the solder used to make the boards carrying the processors. Radioactive lead in the solder was found to be causing bad data in the L1 cache, a chunk of very fast memory meant to hold frequently accessed data. The workaround to this resilience problem on the Blue Gene/L computers was to reprogram the system to, in essence, bypass the L1 cache. That worked, but it made the computations slower.

[...] But the software challenges are also daunting. To understand why, you need to know how today's supercomputer simulations deal with faults. They periodically record the global state of the supercomputer, creating what's called a checkpoint. If the computer crashes, the simulation can then be restarted from the last valid checkpoint instead of beginning some immense calculation anew.

This approach won't work indefinitely, though, because as computers get bigger, the time needed to create a checkpoint increases. Eventually, this interval will become longer than the typical period before the next fault. A challenge for exascale computing is what to do about this grim reality.

The article covers other examples of past problems as well as new ones to be dealt with such as how can one power an exascale computer without it requiring its own 300 MW power plant where "The electric bill to run such a supercomputer would be about a third of a billion dollars per year."

Here's a chance for the graybeards to tell of their experiences with high-performance computing. What problems have you faced? What are your stumbling blocks today? Where do you foresee the biggest challenges in the years to come?


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Interesting) by Anonymous Coward on Wednesday March 02 2016, @03:19PM

    by Anonymous Coward on Wednesday March 02 2016, @03:19PM (#312652)

    It seems that there are inherent upper limits to computing power.

    I predict that, first, someone will write a PhD thesis on that, basically creating some sort of new Shannon's formula, second, whole computing paradigm will change profoundly before we can make viable Exa+ scale computers, and third, these limits will affect quantum computers too (and will affect them even more than they affect binary computers of similar size).

    On the plus side, strong cryptography will probably stay ahead of computer assisted cryptanalysis.

    • (Score: 3, Interesting) by takyon on Wednesday March 02 2016, @03:50PM

      by takyon (881) <{takyon} {at} {soylentnews.org}> on Wednesday March 02 2016, @03:50PM (#312664) Journal

      All it takes is for something like this to work and then all bets are off:

      http://www.hpcwire.com/2014/08/06/exascale-breakthrough-weve-waiting/ [hpcwire.com]
      http://www.nextplatform.com/2015/03/25/a-light-approach-to-genomics-with-optical-processors/ [nextplatform.com]

      Desk-exascale. It's described as a coprocessor in later articles, but many supercomputers use tons of those anyway.

      I will do an article on it when I'm sure it's not bunk and they deliver on one of their big ticket promises (for example, a multi-"petaflop" machine using a couple kilowatts of power).

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    • (Score: 0) by Anonymous Coward on Wednesday March 02 2016, @10:32PM

      by Anonymous Coward on Wednesday March 02 2016, @10:32PM (#312839)

      Finding a limit and calling it "an inherent upper limit" is one of the most oft-repeated fallacies of science.

      Are you sure you know what you're talking about and aren't just extrapolating from some 5-10-year-old-headlines with your incomplete picture of how computation works?

    • (Score: 1) by khallow on Thursday March 03 2016, @02:25AM

      by khallow (3766) Subscriber Badge on Thursday March 03 2016, @02:25AM (#312920) Journal
      Errors are not that significant. You can stuff in a little error correcting code for a modest computation hit and greatly lower the error rate. The problems they described above were for a system with some serious architectural issues.

      Also, one doesn't need to record the global state of a supercomputer for this sort of local error. For example, one approach would be to have three processors at a time do identical computations and swap out the processor and/or memory that breaks.
  • (Score: 1, Insightful) by Anonymous Coward on Wednesday March 02 2016, @03:52PM

    by Anonymous Coward on Wednesday March 02 2016, @03:52PM (#312666)

    You still have to buy server boards to get ECC memory even though 16GB+ is common and file systems exist that require memory integrity: http://louwrentius.com/please-use-zfs-with-ecc-memory.html [louwrentius.com] , how long until the average desktop cannot last a day without a kernel panic and file system corruption is common?

    • (Score: 5, Insightful) by Anonymous Coward on Wednesday March 02 2016, @05:31PM

      by Anonymous Coward on Wednesday March 02 2016, @05:31PM (#312703)

      Found the Intel fan-boy. Look at boards for AMD processors. No shortage of ECC there.

      • (Score: 3, Informative) by rleigh on Wednesday March 02 2016, @11:15PM

        by rleigh (4887) on Wednesday March 02 2016, @11:15PM (#312860) Homepage

        My previous system was an Intel Core2 quad. My current system is an AMD FX 8-core. The main reason for that was support for ECC memory by the mainboard without having to spend a fortune on a Xeon or whatever. My NAS running ZFS on FreeBSD is also AMD (HP Gen7 micro server). Again due to the ECC RAM.

        It's not that Intel don't do ECC. But there's typically a significant premium to pay for that one feature, and I'm not at all unhappy with using AMD for development work, gaming etc. The one system can rebuild the entirely of Debian from source (all >20k source packages) in ~36h, which isn't too shabby. You *will* hit memory errors building at that scale and using every last bit of RAM for compiling--you see the corrected ECC errors periodically in the logs--so it's not really an optional extra on today's systems. Given current memory capacity, small feature sizes and the probabilities involved, errors are certainties.

        • (Score: 2) by Techwolf on Thursday March 03 2016, @04:05AM

          by Techwolf (87) on Thursday March 03 2016, @04:05AM (#312950)

          What motherboards are you using?

          • (Score: 2) by rleigh on Sunday March 13 2016, @12:45PM

            by rleigh (4887) on Sunday March 13 2016, @12:45PM (#317607) Homepage

            My main PC is an ASUS Sabertooth R2.0 mainboard with an 8 core AMD FX processor. My NAS is an HP N40L microserver--custom HP board.

  • (Score: 5, Interesting) by c0lo on Wednesday March 02 2016, @04:12PM

    by c0lo (156) Subscriber Badge on Wednesday March 02 2016, @04:12PM (#312679) Journal

    Google [toronto.edu] (PDF warning)

    ...The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days...

    For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit** and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode.
    We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account.
    Finally, unlike commonly feared, we don’t observe any indication that newer generations of DIMMs have worse error behavior.

    ** that's approx 2 to 5 single bit errors in 8 Gigabytes of RAM per hour!!

    NASA's JPL reports [nasa.gov] (PDF warning) in the context of Cassini–Huygens (redundant two identical flight recorders, each of 2.5 gigabits of DRAM memory) - during the first 2 1/2 years, the spacecraft reported (by telemetry) a nearly constant single-bit error rate of about 280 errors per day.
    On Nov 1997, this rate quadrupled due to a small solar proton event.

    --
    https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
  • (Score: 0) by Anonymous Coward on Wednesday March 02 2016, @05:35PM

    by Anonymous Coward on Wednesday March 02 2016, @05:35PM (#312704)

    It's my understanding that most desktop computers don't have error-correcting memory anymore (to cut costs). So, what's the rate that a cosmic ray hit will crash or corrupt a running program on a desktop?

    • (Score: 5, Interesting) by Arik on Wednesday March 02 2016, @05:46PM

      by Arik (4543) on Wednesday March 02 2016, @05:46PM (#312708) Journal
      "It's my understanding that most desktop computers don't have error-correcting memory anymore (to cut costs). "

      This is true and I have ranted about it and been ignored for years.

      Uncorrected DRAM should be recognized as defective by design and so should any computer using it. It should simply not be produced. If all production was shifted to ECC memory the per-unit increase in cost would be a few pennies if that.

      "So, what's the rate that a cosmic ray hit will crash or corrupt a running program on a desktop?"

      The article gives us an easy way to approximate it.

      "Jaguar had 360 terabytes of main memory, all protected by ECC. I and others at the lab set it up to log every time a bit was flipped incorrectly in main memory. When I asked my computing colleagues elsewhere to guess how often Jaguar saw such a bit spontaneously change state, the typical estimate was about a hundred times a day. In fact, Jaguar was logging ECC errors at a rate of 350 per minute."

      So 360 terabytes of what we must presume to be relatively high quality DRAM errors 350 times per minute. Rounded off a bit that's once per minute per terabyte. Figuring you have 12 gigabytes on your home system, that works out to roughly once every hour-and-a-half.

      The more dense memory becomes the more subject it is to errors. We had ECC on all serious systems back in the days when a MEGABYTE was a lot of RAM. To put anything else in a system today is just astonishingly inappropriate.
      --
      If laughter is the best medicine, who are the best doctors?
      • (Score: 0) by Anonymous Coward on Wednesday March 02 2016, @06:40PM

        by Anonymous Coward on Wednesday March 02 2016, @06:40PM (#312729)

        So 360 terabytes of what we must presume to be relatively high quality DRAM errors 350 times per minute. Rounded off a bit that's once per minute per terabyte. Figuring you have 12 gigabytes on your home system, that works out to roughly once every hour-and-a-half.

        It's late at night for me, so I'm too lazy to work it out- does the math really work that way? Can you divide the probability of coin tosses or dice rolls in that fashion?

        • (Score: 2) by rleigh on Wednesday March 02 2016, @11:21PM

          by rleigh (4887) on Wednesday March 02 2016, @11:21PM (#312865) Homepage

          Of course. The probability of an error (bit errors per minute per terabyte) is unchanged. You're just recscaling the error rate to a smaller memory size, so it's basic algebra.

        • (Score: 2) by Arik on Thursday March 03 2016, @01:32AM

          by Arik (4543) on Thursday March 03 2016, @01:32AM (#312909) Journal
          The only obvious error I see reading back is that the question actually asked the chance of it corrupting something important, to paraphrase. I answered with the chances of a bit flip occurring. Obviously the answer to the question he asked is some fraction of the answer I gave. I'd be hard pressed to even give you a ballpark on it though. I'm not aware of any recent research and I suspect it would be a very difficult question to answer quantitatively and accurately.

          I'd say it's clearly far too likely when we've had proper technology for many many decades to prevent it, and the costs involved are so miniscule. This is one of those things that can only happen because computing is this very odd market where most of the buyers have insufficient understanding of what they are buying to make intelligent decisions.
          --
          If laughter is the best medicine, who are the best doctors?
    • (Score: 4, Funny) by bob_super on Wednesday March 02 2016, @07:21PM

      by bob_super (1357) on Wednesday March 02 2016, @07:21PM (#312744)

      Why did you think true geeks live in the basement, and away from the extra-terrestrial radiation?

      • (Score: 0) by Anonymous Coward on Thursday March 03 2016, @11:30AM

        by Anonymous Coward on Thursday March 03 2016, @11:30AM (#313035)

        Ah, but there are other [wikipedia.org] sources of radiation in the basement to fill that void.

  • (Score: 1) by zugedneb on Wednesday March 02 2016, @06:31PM

    by zugedneb (4556) on Wednesday March 02 2016, @06:31PM (#312724)

    Use an AMD Phenom 2 based computer for folding since 2012, and when there is no time to dualboot for gaming, it has several weeks of uptime.

    Never seen an ECC event on it, although I check the count every time before reboot...
    (4 sticks of Kingston ECC ddr3, 16 GB)

    --
    old saying: "a troll is a window into the soul of humanity" + also: https://en.wikipedia.org/wiki/Operation_Ajax
    • (Score: 2) by tibman on Wednesday March 02 2016, @07:33PM

      by tibman (134) Subscriber Badge on Wednesday March 02 2016, @07:33PM (#312752)

      Same here! Using an AMD Phenom 2 965 along with an AMD HD 6970 gpu. Also fold with my gaming rig sometimes: AMD R9 270 gpu and AMD A10 apu.

      If you aren't currently in a group, maybe you could join SN: http://fah-web2.stanford.edu/cgi-bin/main.py?qtype=teampage&teamnum=230319 [stanford.edu]

      --
      SN won't survive on lurkers alone. Write comments.
    • (Score: 2) by forkazoo on Thursday March 03 2016, @01:20AM

      by forkazoo (2561) on Thursday March 03 2016, @01:20AM (#312904)

      If you've never seen an ECC event, how sure are you that it's being logged accurately? Wouldn't you need to see at least one to know the counter is working?

      • (Score: 1) by zugedneb on Thursday March 03 2016, @03:18AM

        by zugedneb (4556) on Thursday March 03 2016, @03:18AM (#312933)

        good point...
        well, I have not seen the effect of any potential error either.
        no hanging until I reboot, no calculation error in work units, no wrong pixel on screen...

        there must be a reason to suspect, apart from 0...

        --
        old saying: "a troll is a window into the soul of humanity" + also: https://en.wikipedia.org/wiki/Operation_Ajax
  • (Score: 1, Insightful) by Anonymous Coward on Wednesday March 02 2016, @07:29PM

    by Anonymous Coward on Wednesday March 02 2016, @07:29PM (#312748)

    If they hired more greybeards maybe they wouldn't be repeating old mistakes.

    1) Do they test their memory at boot time?
    2) Do they test-format their hard drives before installation? You know ... swap?

    I find people skip both of these steps frequently. As if whatever testing MIGHT have occurred, didn't happen six months in the past and eight thousand miles distant.

    You get what you pay for. Time is money. Downtime is money, too.

    Your choice, folks.

    If you don't have time to do it right the first time, when WILL you have time to RE-do it?

    Wish I had a lawn to order you off of, but I don't even have an unemployment check - never mind a house, or a car, or a lawn.

    ~childo

  • (Score: 3, Informative) by fnj on Thursday March 03 2016, @04:24AM

    by fnj (1654) on Thursday March 03 2016, @04:24AM (#312956)

    When I saw "360 TB of RAM", my jaw dropped. What could anybody possibly need that much RAM for? Actually, the specs [top500.org] say the Jaguar has 584 TB.

    Then it struck me. With 298,592 cores (no shit, that's the honest count), that is only 2 GB per core, or perhaps more pertinently, 32 GB per set of 16 cores. That's how many cores each Opteron 6274 [cpu-world.com] has: 16.

    FWIW, the power dissipation of the entire monstrosity divided by the total core count is only 17 watts per core.

  • (Score: 1) by butthurt on Thursday March 03 2016, @08:03AM

    by butthurt (6141) on Thursday March 03 2016, @08:03AM (#313001) Journal

    The summary mentions steel cabinets. If they're made of recycled steel, they may [nrc.gov] contain [taipeitimes.com] cobalt-60 [nytimes.com].

    The fortune served up is fitting for a story that mentions Macintosh G5 computers:

    The memory management on the PowerPC can be used to frighten small children.
    -- Linus Torvalds