Stories
Slash Boxes
Comments

SoylentNews is people

posted by takyon on Tuesday June 23 2020, @04:00PM   Printer-friendly
from the exascale-arm-wrestling dept.

New #1 Supercomputer: Fujitsu's Fugaku

High performance computing is now at a point in its existence where to be the number one, you need very powerful, very efficient hardware, lots of it, and lots of capability to deploy it. Deploying a single rack of servers to total a couple of thousand cores isn't going to cut it. The former #1 supercomputer, Summit, is built from 22-core IBM Power9 CPUs paired with NVIDIA GV100 accelerators, totaling 2.4 million cores and consuming 10 MegaWatts of power. The new Fugaku supercomputer, built at Riken in partnership with Fujitsu, takes the top spot on the June 2020 #1 list, with 7.3 million cores and consuming 28 MegaWatts of power.

The new Fugaku supercomputer is bigger than Summit in practically every way. It has 3.05x cores, it has 2.8x the score in the official LINPACK tests, and consumes 2.8x the power. It also marks the first time that an Arm based system sits at number one on the top 500 list.

Also at NYT.

Fujitsu Fugaku report by Jack Dongarra (3.3 MB PDF)

The Fujitsu A64FX is a 64-bit ARM CPU with 48 cores and 2-4 cores assistant cores for the operating system. It uses 32 GiB of on-package High Bandwidth Memory 2. There are no GPUs or accelerators used in the the Fugaku supercomputer.

Fugaku can reach as high as 537 petaflops of FP64 (boost mode), or 1.07 exaflops of FP32, 2.15 exaflops of FP16, and 4.3 exaOPS of INT8. Theoretical peak memory bandwidth is 163 petabytes per second.

RMAX of #10 system: 18.2 petaflops (November 2019), 21.23 petaflops (June 2020)
RMAX of #100 system: 2.57 petaflops (November 2019), 2.802 petaflops (June 2020)
RMAX of #500 system: 1.142 petaflops (November 2019), 1.23 petaflops (June 2020)

See also: Arm Aligns Its Server Ambitions To Those Of Its Partners
AMD Scores First Top 10 Zen Supercomputer... at NVIDIA

June 2020 TOP 500 Supercomputer List Announced

Every six months TOP500.org announces its list of the top 500 fastest supercomputers. The new TOP500 list -- their 55th -- was announced today with a brand new system at the top.

Installed at the RIKEN Center for Computational Science, the system is named Fugaku. It is comprised of Fujitsu A64FX SoCs, each of which sports 48 cores at 2.2 GHz and is based on the ARM architecture. In total, it has 7,299,072 cores and attains an Rmax of 415.5 (PFlop/s) on the High Performance Linpack benchmark.

The previous top system is now in 2nd place. The Summit is located at the Oak Ridge National Laboratory and was built by IBM. Each node has two 22-core 3.07 GHz Power9 CPUs and six NVIDIA Tesla V100 GPUs. With a total of 2,414,592 cores, it is rated at an Rmax of 148.6 (PFlop/s).

Rounding out the top 3 is the Sierra which is also by IBM. It has 22-core POWER9 CPUs running at 3.1GHz and NVIDIA Volta GV100 GPUs. Its score is 94.6 (PFlop/s).

When the list was first published in June of 1993, the top system on the list, installed at Los Alamos National Laboratory, was a CM-5/1024 by Thinking Machines Corporation. Comprised of 1,024 cores, it was rated at a peak of 59.7 Rmax (GFlop/s). (It would require over 8.6 million of them to match the compute power of today's number one system.) in June 1993, #100 was a Cray Y-MP8/8128 installed at Lawrence Livermore National Laboratory and rated at 2.1 Rmax (GFlop/s). On that first list, 500th place went to an HPE C3840 having 4 cores and an Rmax of 0.4 (GFlop/s). Yes, that is 400 KFlop/s.

I wonder how today's cell phones would rate against that first list?

For the curious, the benchmark code can be downloaded from http://www.netlib.org/benchmark/hpl/.


Original Submission #1Original Submission #2Original Submission #3

Related Stories

TOP500 Supercomputers in November 2020: Germany at #7, Saudi Arabia at #10 13 comments

TOP500 Expands Exaflops Capacity Amidst Low Turnover

The entry level to the list moved up to 1.32 petaflops on the High Performance Linpack (HPL) benchmark, a small increase from 1.23 petaflops recorded in the June 2020 rankings. In a similar vein, the aggregate performance of all 500 systems grew from 2.22 exaflops in June to just 2.43 exaflops on the latest list. Likewise, average concurrency per system barely increased at all, growing from 145,363 cores six months ago to 145,465 cores in the current list.

There were, however, a few notable developments in the top 10, including two new systems, as well as a new highwater mark set by the top-ranked Fugaku supercomputer. Thanks to additional hardware, Fugaku grew its HPL performance to 442 petaflops, a modest increase from the 416 petaflops the system achieved when it debuted in June 2020. More significantly, Fugaku increased its performance on the new mixed precision HPC-AI benchmark to 2.0 exaflops, besting its 1.4 exaflops mark recorded six months ago. These represents the first benchmark measurements above one exaflop for any precision on any type of hardware.

[...] At number five is Selene, an NVIDIA DGX A100 SuperPOD installed in-house at NVIDIA Corp. It was listed as number seven in June but has doubled in size, allowing it to move up the list by two positions. The system is based on AMD EPYC processors with NVIDIA's new A100 GPUs for acceleration. Selene achieved 63.4 petaflops on HPL as a result of the upgrade.

[...] A new supercomputer, known as the JUWELS Booster Module, debuts at number seven on the list. The Atos-built BullSequana machine was recently installed at the Forschungszentrum Jülich (FZJ) in Germany. It is part of a modular system architecture and a second Xeon based JUWELS Module is listed separately on the TOP500 at position 44. These modules are integrated by using the ParTec Modulo Cluster Software Suite. The Booster Module uses AMD EPYC processors with NVIDIA A100 GPUs for acceleration similar to the number five Selene system. Running by itself the JUWELS Booster Module was able to achieve 44.1 HPL petaflops, which makes it the most powerful system in Europe.

[...] The second new system at the top of the list is Dammam-7, which is ranked 10th. It is installed at Saudi Aramco in Saudi Arabia and is the second commercial supercomputer in the current top 10. The HPE Cray CS-Storm systems uses Intel Gold Xeon CPUs and NVIDIA Tesla V100 GPUs. It reached 22.4 petaflops on the HPL benchmark.

The Green500 list is led by a smaller NVIDIA DGX SuperPOD system at 26.2 gigaflops/Watt (ranked #171 on the TOP500).

#1 system: 415.5 petaflops Rmax (June 2020), 442 petaflops (November 2020)
#10 system: 21.2 petaflops (June), 22.4 petaflops (Nov)
#100 system: 2.8 petaflops (June), 3.15 petaflops (Nov)
#500 system: 1.23 petaflops (June), 1.32 petaflops (Nov)

Previously: Fujitsu's ARM-based Fugaku Supercomputer Leads June 2020 TOP500 List at 415 PetaFLOPs


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 3, Informative) by Mojibake Tengu on Tuesday June 23 2020, @04:20PM

    by Mojibake Tengu (8598) on Tuesday June 23 2020, @04:20PM (#1011612) Journal
    --
    Respect Authorities. Know your social status. Woke responsibly.
  • (Score: 2) by KilroySmith on Tuesday June 23 2020, @04:41PM

    by KilroySmith (2113) on Tuesday June 23 2020, @04:41PM (#1011627)
  • (Score: 2) by KilroySmith on Tuesday June 23 2020, @04:56PM (5 children)

    by KilroySmith (2113) on Tuesday June 23 2020, @04:56PM (#1011635)

    In 1993, #1 was 60 GFlop/s / 1024 cores = 0.06 GFlop/s per core
    In 2020, #1 is 148,600,000 GFlop/s / 2,414,592 cores = 60 GFlop/s per core.
    So, roughly 1000 times faster per core. A bit less improvement than I would have expected.

    • (Score: 2) by takyon on Tuesday June 23 2020, @05:04PM (4 children)

      by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Tuesday June 23 2020, @05:04PM (#1011638) Journal

      2,414,592 seems to be counting CPU and GPU cores. A single GPU core will be much weaker.

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 2) by KilroySmith on Tuesday June 23 2020, @05:22PM (2 children)

        by KilroySmith (2113) on Tuesday June 23 2020, @05:22PM (#1011650)

        Hmm, their report says 152,064 nodes in their system, with each node having 48 ARM V8 compute cores + 4 "assistant" cores for the OS. That's a total of 7,299,072 compute cores, which would seem to be only CPU cores - there don't appear to be any "GPU" cores in the system at all. They seem to have integrated support for AI-specific datatypes (FP16, etc) into the ARM core.

        • (Score: 2) by takyon on Tuesday June 23 2020, @05:27PM (1 child)

          by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Tuesday June 23 2020, @05:27PM (#1011655) Journal

          You picked out the core count for Summit, which uses POWER9 CPUs and Nvidia GPUs.

          The previous top system is now in 2nd place. The Summit is located at the Oak Ridge National Laboratory and was built by IBM. Each node has two 22-core 3.07 GHz Power9 CPUs and six NVIDIA Tesla V100 GPUs. With a total of 2,414,592 cores, it is rated at an Rmax of 148.6 (PFlop/s).

          The new #1 uses purely A64FX CPUs.

          Blame martyb.

          --
          [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 2) by KilroySmith on Tuesday June 23 2020, @05:27PM

        by KilroySmith (2113) on Tuesday June 23 2020, @05:27PM (#1011654)

        Ooops, just realized that I used the number of cores from the #2 machine.
        For today,
        415,500,000 GFlops/sec / 7,299,072 cores = 57 GFlops/sec per core.

        So, no significant change to the conclusion. Whew.

  • (Score: 2, Funny) by DannyB on Tuesday June 23 2020, @05:24PM (2 children)

    by DannyB (5839) Subscriber Badge on Tuesday June 23 2020, @05:24PM (#1011651) Journal

    Imagine a Beowulf cluster of these!

    --
    People today are educated enough to repeat what they are taught but not to question what they are taught.
    • (Score: 0) by Anonymous Coward on Tuesday June 23 2020, @10:48PM

      by Anonymous Coward on Tuesday June 23 2020, @10:48PM (#1011759)

      I came her just to see this comment.

    • (Score: 1) by petecox on Wednesday June 24 2020, @02:26AM

      by petecox (3228) on Wednesday June 24 2020, @02:26AM (#1011818)

      aarch64 Hackintosh.

  • (Score: 2) by DannyB on Tuesday June 23 2020, @05:25PM (6 children)

    by DannyB (5839) Subscriber Badge on Tuesday June 23 2020, @05:25PM (#1011652) Journal

    I wish the top 500 list showed which OS was in use on each supercomputer.

    I tried to download an Excel spreadsheet of the top 500, but they wanted me to create a login.

    --
    People today are educated enough to repeat what they are taught but not to question what they are taught.
    • (Score: 3, Informative) by takyon on Tuesday June 23 2020, @05:29PM (4 children)

      by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Tuesday June 23 2020, @05:29PM (#1011656) Journal

      Click the individual supercomputer name.

      https://www.top500.org/lists/top500/2020/06/ [top500.org]

      https://www.top500.org/system/179807/ [top500.org]

      Red Hat Enterprise Linux
      FUJITSU Software Technical Computing Suite V4.0

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 2) by DannyB on Tuesday June 23 2020, @05:35PM (1 child)

        by DannyB (5839) Subscriber Badge on Tuesday June 23 2020, @05:35PM (#1011660) Journal

        Useful, but too bad I can't easily see how many of these monsters are running the wonderful Windows operating system.

        --
        People today are educated enough to repeat what they are taught but not to question what they are taught.
      • (Score: 0) by Anonymous Coward on Tuesday June 23 2020, @07:01PM (1 child)

        by Anonymous Coward on Tuesday June 23 2020, @07:01PM (#1011683)

        systemd keeps the everything load balanced. amirite amirite?

        • (Score: 2) by DannyB on Tuesday June 23 2020, @07:31PM

          by DannyB (5839) Subscriber Badge on Tuesday June 23 2020, @07:31PM (#1011690) Journal

          With such gigantic clusters of processors, just imagine the amazing wonderfulness of a gigantic fuster cluck of systemd!

          It would be mined boggling to blow one's mined.

          --
          People today are educated enough to repeat what they are taught but not to question what they are taught.
    • (Score: 2) by KilroySmith on Tuesday June 23 2020, @05:31PM

      by KilroySmith (2113) on Tuesday June 23 2020, @05:31PM (#1011657)

      Well, you could click the "Fujitsu Fugaku report by Jack Dongarra" link provided in the summary, and read that:
      "The operating system is RedHat Enterprise Linux 8 and McKernel (light-weight multi kernel operating system)."

      I'm guessing that they're running Linux on the 4 "assistant" cores on each chip, with McKernel running on each of the 48 compute cores.

  • (Score: 3, Insightful) by DannyB on Tuesday June 23 2020, @05:33PM (9 children)

    by DannyB (5839) Subscriber Badge on Tuesday June 23 2020, @05:33PM (#1011659) Journal

    I would not (yet) have expected ARM CPUs in the top 500 list. So ARM on supercomputer is reality and no longer imagination.

    There are no GPUs or accelerators used in the the Fugaku supercomputer.

    Also interesting to not have any GPUs. I dream of desktops having lots of general purpose cpu cores, so many that gpu's become unnecessary. Don't say crazy to dream. In 1975 Altair 8800 era, today's disposable Raspberry PI compute power would have seemed a ridiculous daydream.

    --
    People today are educated enough to repeat what they are taught but not to question what they are taught.
    • (Score: 3, Informative) by takyon on Tuesday June 23 2020, @05:58PM (3 children)

      by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Tuesday June 23 2020, @05:58PM (#1011668) Journal

      ARM has ramped up its presence quickly:

      https://www.arm.com/company/news/2020/06/powering-the-fastest-supercomputer [arm.com]

      Go to "Processor Generation":

      https://www.top500.org/statistics/list/ [top500.org]

      Only a handful of ARM supercomputers: Fujitsu A64FX in 3 systems, Marvell ThunderX2 in 1, and I think that's it. But... if you sort by RMax, ARM kind of leads with 424 petaflops total (in reality, Intel Xeon's huge share is split across several different product generations in that list).

      x86 will make a comeback [anandtech.com], with a 1 EFLOPS Intel "Aurora" and 1.5 EFLOPS AMD "Frontier" systems next year, and 2 EFLOPS AMD "El Capitan" in 2023.

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 2) by DannyB on Tuesday June 23 2020, @07:38PM (2 children)

        by DannyB (5839) Subscriber Badge on Tuesday June 23 2020, @07:38PM (#1011692) Journal

        ARM has ramped up its presence quickly

        ARM's presents on supercomputers is welcome.

        x86 will make a comeback

        That article is about AMD success (yea!), not Intel. Not that I'm sad about that.

        Now let's see . . .

        Google has ARM chromebooks (and amd64 chromebooks).

        Apple transitioning to ARM.

        ARM in everything mobile. And some embedded.

        Microsoft might try, try again to transition to ARM. (If at first you don't succeed, use a shorter bungee!) But Microsoft's problem is that the only real value of Windows is the legacy software base.

        --
        People today are educated enough to repeat what they are taught but not to question what they are taught.
    • (Score: 3, Insightful) by TheRaven on Wednesday June 24 2020, @04:14PM (4 children)

      by TheRaven (270) on Wednesday June 24 2020, @04:14PM (#1012026) Journal

      I would not (yet) have expected ARM CPUs in the top 500 list

      ARM's SVE ISA was (as I understand it) largely developed by Fujitsu specifically for supercomputers. Fujitsu has a lot of experience building supercomputers, with their own SPARC64 chips. They wanted something that had large vector support but also was optimised for the same kinds of workloads that CPUs are good for (branchy, locality of data). They could have added this to SPARC, but they are basically paying for all of the toolchain work on SPARC and no one else is optimising for it. ARM, for them, lets them focus on the bits that they really know well (high-performance 512-bit vector units) and share the costs of developing the software ecosystem with others.

      Also interesting to not have any GPUs. I dream of desktops having lots of general purpose cpu cores, so many that gpu's become unnecessary

      GPUs will be necessary for power, if not for performance. CPUs are not general-purpose processors. They are processors optimised for instruction sequences that have a lot of branches (on average, around one every 7 instructions), primarily doing integer compute, and with random memory access patterns that exhibit strong locality of reference. In contrast, GPUs are optimised for instruction streams that have few branches, lots of floating point operations, and regular streaming memory access patterns. Any processor that is optimised for both of these will be less efficient for either. A GPU, for example, doesn't do register renaming or speculative execution and so saves power from not needing these.

      --
      sudo mod me up
      • (Score: 2) by DannyB on Wednesday June 24 2020, @04:19PM (3 children)

        by DannyB (5839) Subscriber Badge on Wednesday June 24 2020, @04:19PM (#1012034) Journal

        GPUs will be necessary for power, if not for performance. CPUs are not general-purpose processors. They are processors optimised for instruction sequences that have a lot of branches (on average, around one every 7 instructions), primarily doing integer compute, and with random memory access patterns that exhibit strong locality of reference. In contrast, GPUs are optimised for instruction streams that have few branches, lots of floating point operations, and regular streaming memory access patterns. Any processor that is optimised for both of these will be less efficient for either. A GPU, for example, doesn't do register renaming or speculative execution and so saves power from not needing these.

        An excellent explanation of the use case of having lots of CPU cores vs a GPU.

        --
        People today are educated enough to repeat what they are taught but not to question what they are taught.
        • (Score: 2) by TheRaven on Wednesday June 24 2020, @07:08PM (2 children)

          by TheRaven (270) on Wednesday June 24 2020, @07:08PM (#1012102) Journal
          There is no way that you could get the same number of FLOPS per Joule on something with register rename, caches, prefetching, and so on for graphics workloads as you could get from something optimised for this kind of use case. GPUs aren't fast and power efficient because of the things they add, they're fast and power efficient because of the things that they remove.
          --
          sudo mod me up
          • (Score: 2) by DannyB on Wednesday June 24 2020, @09:52PM (1 child)

            by DannyB (5839) Subscriber Badge on Wednesday June 24 2020, @09:52PM (#1012170) Journal

            That makes sense.

            That is the thing to use if you can arrange your problem around how it must be formulated to run on a GPU.

            If you have a branch rich set of code, this may not work well.

            As I understand it, suppose you can program your GPU kernel in C, with something like:

            if( foo ) {
                doThingOne
            } else {
                doThingTwo
            }

            The machine code will be something like:

            1 if( foo )
            2 doThingOne
            3 doThingTwo

            As all of the processing elements execute the same instructions together, some processors will execute instruction 2 and treat instruction 3 as a NOP, and some other processors will do the opposite and treat 2 as NOP and execute instruction 3. Now imagine a complex function, pulling in multiple third party libraries, and compiling that into a form that executes as SIMD on a GPU. What if my IF branch calls two alternate functions that call other functions?

            Please feel free to correct me if I'm wrong on that point.

            Now if you have a problem that is not a large stream of memory accesses, this may not be suitable. Suppose the thing I want to run on one core is a fairly complex function, using libraries, big integers, complex functions. But I have many thousands of them to run. And each of them can be computed independently of one another. That works okay on either (1) multiple cores on a single computer, or even (2) cores spread across multiple computers on a network, IIF, the cost of the computation greatly exceeds the cost to "marshal" all of the parameters to send the request over the network for another cpu to process, and then to marshal/unmarshal the response that comes back.

            Here's another problem that I'm not sure how GPUs would handle. Suppose each "work item" might take different amounts of time to compute? (Example, "pixels" in the Mandelbrot set -- they don't all take the same amount of time to complete.) So I throw out many thousands of "work items" to a batch of available CPUs, they will work on them, when one CPU is available, it will consume the next work item from the queue.

            I'm not saying that GPUs don't have a place. I'm saying that there are reasons running some calculations on more conventional processors appears desirable.

            I've done some looking at using GPUs (from Java, which is quite possible). For personal fun problems.

            If GPUs were suitable to take over all computation, they would have already.

            --
            People today are educated enough to repeat what they are taught but not to question what they are taught.
            • (Score: 2) by TheRaven on Thursday June 25 2020, @12:25PM

              by TheRaven (270) on Thursday June 25 2020, @12:25PM (#1012365) Journal

              As all of the processing elements execute the same instructions together, some processors will execute instruction 2 and treat instruction 3 as a NOP, and some other processors will do the opposite and treat 2 as NOP and execute instruction 3

              More or less. There's some variation, but typically you'll actually end up with both being executed and then a select on the result. For small amounts of divergent flow control, that's more efficient because the cost of the machinery that you need to deal with keeping a pipelined CPU fed with data.

              Some modern GPU designs are a bit more complex and are closer to SMT processors. They deal with predictable streaming access patterns, so they'll have a large prefetch buffer for each hardware context and execute instructions round-robin from things

              If GPUs were suitable to take over all computation, they would have already.

              That's exactly what I originally said. Neither GPUs nor CPUs are general-purpose processors. They are both aggressively tuned for specific classes of workloads. If you want something that is as fast for both kinds of workload, you will end up with something that consumes more power than the sum of the CPU and GPU. If you have a category of workload for which neither is an ideal fit, you want something different. This is why SoCs exist and have dozens of specialised cores and why the cores in TFA have very wide vector units (HPC workloads have some of the characteristics that GPUs are optimised for and some of the characteristics that CPUs are optimised for, the best processors for HPC have some of the characteristics of both).

              --
              sudo mod me up
  • (Score: 0) by Anonymous Coward on Tuesday June 23 2020, @09:32PM (1 child)

    by Anonymous Coward on Tuesday June 23 2020, @09:32PM (#1011722)

    didn't like one of the first intel CPUs (8086?) have about "7,299,072" TRANSISTORS?

    • (Score: 2) by TheRaven on Wednesday June 24 2020, @04:16PM

      by TheRaven (270) on Wednesday June 24 2020, @04:16PM (#1012029) Journal
      Much fewer for 8086. The ARM1 processor had about 25,000 transistors, the 80386 had about 275,000. 7 million is around the number for the Pentium Pro or the Pentium II.
      --
      sudo mod me up
  • (Score: 2) by bart on Tuesday June 23 2020, @09:42PM

    by bart (2844) on Tuesday June 23 2020, @09:42PM (#1011732)

    You better stay out of engineering!

  • (Score: -1, Troll) by Anonymous Coward on Wednesday June 24 2020, @01:30AM

    by Anonymous Coward on Wednesday June 24 2020, @01:30AM (#1011798)

    As a self-identified Black businesswoman, I find it appalling that Japan is top of this list when the government could start a program to subsidize Black-owned companies to produce a faster computer in the USA.

(1)