Stories
Slash Boxes
Comments

SoylentNews is people

posted by takyon on Tuesday June 23 2020, @04:00PM   Printer-friendly
from the exascale-arm-wrestling dept.

New #1 Supercomputer: Fujitsu's Fugaku

High performance computing is now at a point in its existence where to be the number one, you need very powerful, very efficient hardware, lots of it, and lots of capability to deploy it. Deploying a single rack of servers to total a couple of thousand cores isn't going to cut it. The former #1 supercomputer, Summit, is built from 22-core IBM Power9 CPUs paired with NVIDIA GV100 accelerators, totaling 2.4 million cores and consuming 10 MegaWatts of power. The new Fugaku supercomputer, built at Riken in partnership with Fujitsu, takes the top spot on the June 2020 #1 list, with 7.3 million cores and consuming 28 MegaWatts of power.

The new Fugaku supercomputer is bigger than Summit in practically every way. It has 3.05x cores, it has 2.8x the score in the official LINPACK tests, and consumes 2.8x the power. It also marks the first time that an Arm based system sits at number one on the top 500 list.

Also at NYT.

Fujitsu Fugaku report by Jack Dongarra (3.3 MB PDF)

The Fujitsu A64FX is a 64-bit ARM CPU with 48 cores and 2-4 cores assistant cores for the operating system. It uses 32 GiB of on-package High Bandwidth Memory 2. There are no GPUs or accelerators used in the the Fugaku supercomputer.

Fugaku can reach as high as 537 petaflops of FP64 (boost mode), or 1.07 exaflops of FP32, 2.15 exaflops of FP16, and 4.3 exaOPS of INT8. Theoretical peak memory bandwidth is 163 petabytes per second.

RMAX of #10 system: 18.2 petaflops (November 2019), 21.23 petaflops (June 2020)
RMAX of #100 system: 2.57 petaflops (November 2019), 2.802 petaflops (June 2020)
RMAX of #500 system: 1.142 petaflops (November 2019), 1.23 petaflops (June 2020)

See also: Arm Aligns Its Server Ambitions To Those Of Its Partners
AMD Scores First Top 10 Zen Supercomputer... at NVIDIA

June 2020 TOP 500 Supercomputer List Announced

Every six months TOP500.org announces its list of the top 500 fastest supercomputers. The new TOP500 list -- their 55th -- was announced today with a brand new system at the top.

Installed at the RIKEN Center for Computational Science, the system is named Fugaku. It is comprised of Fujitsu A64FX SoCs, each of which sports 48 cores at 2.2 GHz and is based on the ARM architecture. In total, it has 7,299,072 cores and attains an Rmax of 415.5 (PFlop/s) on the High Performance Linpack benchmark.

The previous top system is now in 2nd place. The Summit is located at the Oak Ridge National Laboratory and was built by IBM. Each node has two 22-core 3.07 GHz Power9 CPUs and six NVIDIA Tesla V100 GPUs. With a total of 2,414,592 cores, it is rated at an Rmax of 148.6 (PFlop/s).

Rounding out the top 3 is the Sierra which is also by IBM. It has 22-core POWER9 CPUs running at 3.1GHz and NVIDIA Volta GV100 GPUs. Its score is 94.6 (PFlop/s).

When the list was first published in June of 1993, the top system on the list, installed at Los Alamos National Laboratory, was a CM-5/1024 by Thinking Machines Corporation. Comprised of 1,024 cores, it was rated at a peak of 59.7 Rmax (GFlop/s). (It would require over 8.6 million of them to match the compute power of today's number one system.) in June 1993, #100 was a Cray Y-MP8/8128 installed at Lawrence Livermore National Laboratory and rated at 2.1 Rmax (GFlop/s). On that first list, 500th place went to an HPE C3840 having 4 cores and an Rmax of 0.4 (GFlop/s). Yes, that is 400 KFlop/s.

I wonder how today's cell phones would rate against that first list?

For the curious, the benchmark code can be downloaded from http://www.netlib.org/benchmark/hpl/.


Original Submission #1Original Submission #2Original Submission #3

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by TheRaven on Wednesday June 24 2020, @07:08PM (2 children)

    by TheRaven (270) on Wednesday June 24 2020, @07:08PM (#1012102) Journal
    There is no way that you could get the same number of FLOPS per Joule on something with register rename, caches, prefetching, and so on for graphics workloads as you could get from something optimised for this kind of use case. GPUs aren't fast and power efficient because of the things they add, they're fast and power efficient because of the things that they remove.
    --
    sudo mod me up
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by DannyB on Wednesday June 24 2020, @09:52PM (1 child)

    by DannyB (5839) Subscriber Badge on Wednesday June 24 2020, @09:52PM (#1012170) Journal

    That makes sense.

    That is the thing to use if you can arrange your problem around how it must be formulated to run on a GPU.

    If you have a branch rich set of code, this may not work well.

    As I understand it, suppose you can program your GPU kernel in C, with something like:

    if( foo ) {
        doThingOne
    } else {
        doThingTwo
    }

    The machine code will be something like:

    1 if( foo )
    2 doThingOne
    3 doThingTwo

    As all of the processing elements execute the same instructions together, some processors will execute instruction 2 and treat instruction 3 as a NOP, and some other processors will do the opposite and treat 2 as NOP and execute instruction 3. Now imagine a complex function, pulling in multiple third party libraries, and compiling that into a form that executes as SIMD on a GPU. What if my IF branch calls two alternate functions that call other functions?

    Please feel free to correct me if I'm wrong on that point.

    Now if you have a problem that is not a large stream of memory accesses, this may not be suitable. Suppose the thing I want to run on one core is a fairly complex function, using libraries, big integers, complex functions. But I have many thousands of them to run. And each of them can be computed independently of one another. That works okay on either (1) multiple cores on a single computer, or even (2) cores spread across multiple computers on a network, IIF, the cost of the computation greatly exceeds the cost to "marshal" all of the parameters to send the request over the network for another cpu to process, and then to marshal/unmarshal the response that comes back.

    Here's another problem that I'm not sure how GPUs would handle. Suppose each "work item" might take different amounts of time to compute? (Example, "pixels" in the Mandelbrot set -- they don't all take the same amount of time to complete.) So I throw out many thousands of "work items" to a batch of available CPUs, they will work on them, when one CPU is available, it will consume the next work item from the queue.

    I'm not saying that GPUs don't have a place. I'm saying that there are reasons running some calculations on more conventional processors appears desirable.

    I've done some looking at using GPUs (from Java, which is quite possible). For personal fun problems.

    If GPUs were suitable to take over all computation, they would have already.

    --
    People today are educated enough to repeat what they are taught but not to question what they are taught.
    • (Score: 2) by TheRaven on Thursday June 25 2020, @12:25PM

      by TheRaven (270) on Thursday June 25 2020, @12:25PM (#1012365) Journal

      As all of the processing elements execute the same instructions together, some processors will execute instruction 2 and treat instruction 3 as a NOP, and some other processors will do the opposite and treat 2 as NOP and execute instruction 3

      More or less. There's some variation, but typically you'll actually end up with both being executed and then a select on the result. For small amounts of divergent flow control, that's more efficient because the cost of the machinery that you need to deal with keeping a pipelined CPU fed with data.

      Some modern GPU designs are a bit more complex and are closer to SMT processors. They deal with predictable streaming access patterns, so they'll have a large prefetch buffer for each hardware context and execute instructions round-robin from things

      If GPUs were suitable to take over all computation, they would have already.

      That's exactly what I originally said. Neither GPUs nor CPUs are general-purpose processors. They are both aggressively tuned for specific classes of workloads. If you want something that is as fast for both kinds of workload, you will end up with something that consumes more power than the sum of the CPU and GPU. If you have a category of workload for which neither is an ideal fit, you want something different. This is why SoCs exist and have dozens of specialised cores and why the cores in TFA have very wide vector units (HPC workloads have some of the characteristics that GPUs are optimised for and some of the characteristics that CPUs are optimised for, the best processors for HPC have some of the characteristics of both).

      --
      sudo mod me up