Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 14 submissions in the queue.

Submission Preview

Link to Story

How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder

Accepted submission by martyb at 2016-03-01 22:43:05
Hardware

A recent article in IEEE Spectrum looks into the challenges of building an exascale computer [ieee.org].

Al Geist [ornl.gov] works at at Oak Ridge National Laboratory and wories about... monsters hiding in the steel cabinets of the supercomputers that threatening to crash the largest computing machines on the planet.

[...] the second fastest supercomputer in the world in 2002, a machine called ASCI Q at Los Alamos National Laboratory. When it was first installed at the New Mexico lab, this computer couldn’t run more than an hour or so without crashing. ...The problem was that an address bus on the microprocessors found in those servers was unprotected, meaning that there was no check to make sure the information carried on these within-chip signal lines did not become corrupted. And that’s exactly what was happening when these chips were struck by cosmic radiation, the constant shower of particles that bombard Earth’s atmosphere from outer space.

In the summer of 2003, Virginia Tech researchers built a large supercomputer out of 1,100 Apple Power Mac G5 computers. They called it Big Mac. To their dismay, they found that the failure rate was so high it was nearly impossible even to boot the whole system before it would crash.

The problem was that the Power Mac G5 did not have error-correcting code (ECC) memory, and cosmic ray–induced particles were changing so many values in memory that out of the 1,100 Mac G5 computers, one was always crashing.

[...] Just how many spurious bit flips are happening inside supercomputers already? To try to find out, researchers performed a study [selse.org] [PDF] in 2009 and 2010 on the then most powerful supercomputer [top500.org]—a Cray XT5 system at Oak Ridge, in Tennessee, called Jaguar.

Jaguar had 360 terabytes of main memory, all protected by ECC. I and others at the lab set it up to log every time a bit was flipped incorrectly in main memory. When I asked my computing colleagues elsewhere to guess how often Jaguar saw such a bit spontaneously change state, the typical estimate was about a hundred times a day. In fact, Jaguar was logging ECC errors at a rate of 350 per minute.

[Continues.]

EXTENDED COPY FOLLOWS

Supercomputer operators have had to struggle with many other quirky faults as well. To take one example: The IBM Blue Gene/L system [llnl.gov] at Lawrence Livermore National Laboratory, in California, the largest computer in the world from 2004 to 2008, would frequently crash while running a simulation or produce erroneous results. After weeks of searching, the culprit was uncovered: the solder used to make the boards carrying the processors. Radioactive lead in the solder was found to be causing bad data in the L1 cache, a chunk of very fast memory meant to hold frequently accessed data. The workaround to this resilience problem on the Blue Gene/L computers was to reprogram the system to, in essence, bypass the L1 cache. That worked, but it made the computations slower.

[...] But the software challenges are also daunting. To understand why, you need to know how today’s supercomputer simulations deal with faults. They periodically record the global state of the supercomputer, creating what’s called a checkpoint. If the computer crashes, the simulation can then be restarted from the last valid checkpoint instead of beginning some immense calculation anew.

This approach won’t work indefinitely, though, because as computers get bigger, the time needed to create a checkpoint increases. Eventually, this interval will become longer than the typical period before the next fault. A challenge for exascale computing is what to do about this grim reality.

The article covers other examples of past problems as well as new ones to be dealt with such as how can one power an exascale computer without it requiring its own 300 MW power plant where "The electric bill to run such a supercomputer would be about a third of a billion dollars per year."

Here's a chance for the graybeards to tell of their experiences with high-performance computing. What problems have you faced? What are your stumbling blocks today? Where do you foresee the biggest challenges in the years to come?


Original Submission