Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 13 submissions in the queue.
posted by martyb on Wednesday July 11 2018, @06:14PM   Printer-friendly
from the where-many-flops-is-a-GOOD-thing dept.

In an interview posted just before the release of the latest TOP500 list, high performance computing expert Dr. Thomas Sterling (one of the two builders of the original "Beowulf cluster") had this to say about the possibility of reaching "zettascale" (beyond 1,000 exaflops):

I'll close here by mentioning two other possibilities that, while not widely considered currently, are nonetheless worthy of research. The first is superconducting supercomputing and the second is non-von Neumann architectures. Interestingly, the two at least in some forms can serve each other making both viable and highly competitive with respect to future post-exascale computing designs. Niobium Josephson Junction-based technologies cooled to four Kelvins can operate beyond 100 and 200 GHz and has slowly evolved over two or more decades. When once such cold temperatures were considered a show stopper, now quantum computing – or at least quantum annealing – typically is performed at 40 milli-Kelvins or lower, where four Kelvins would appear like a balmy day on the beach. But latencies measured in cycles grow proportionally with clock rate and superconducting supercomputing must take a very distinct form from typical von Neumann cores; this is a controversial view, by the way.

Possible alternative non-von Neumann architectures that would address this challenge are cellular automata and data flow, both with their own problems, of course – nothing is easy. I introduce this thought not to necessarily advocate for a pet project – it is a pet project of mine – but to suggest that the view of the future possibilities as we enter the post-exascale era is a wide and exciting field at a time where we may cross a singularity before relaxing once again on a path of incremental optimizations.

I once said in public and in writing that I predicted we would never get to zettaflops computing. Here, I retract this prediction and contribute a contradicting assertion: zettaflops can be achieved in less than 10 years if we adopt innovations in non-von Neumann architecture. With a change to cryogenic technologies, we can reach yottaflops by 2030.

The rest of the interview covers a number of interesting topics, such as China's increased presence on the supercomputing list.

Also at NextBigFuture.

Previously: Thomas Sterling: 'I Think We Will Never Reach Zettaflops' (2012)

Related: IBM Reduces Neural Network Energy Consumption Using Analog Memory and Non-Von Neumann Architecture
IEEE Releases the International Roadmap for Devices and Systems (IRDS)
June 2018 TOP500 List: U.S. Claims #1 and #3 Spots


Original Submission

Related Stories

IBM Reduces Neural Network Energy Consumption Using Analog Memory and Non-Von Neumann Architecture 38 comments

IBM researchers use analog memory to train deep neural networks faster and more efficiently

Deep neural networks normally require fast, powerful graphical processing unit (GPU) hardware accelerators to support the needed high speed and computational accuracy — such as the GPU devices used in the just-announced Summit supercomputer. But GPUs are highly energy-intensive, making their use expensive and limiting their future growth, the researchers explain in a recent paper published in Nature.

Instead, the IBM researchers used large arrays of non-volatile analog memory devices (which use continuously variable signals rather than binary 0s and 1s) to perform computations. Those arrays allowed the researchers to create, in hardware, the same scale and precision of AI calculations that are achieved by more energy-intensive systems in software, but running hundreds of times faster and at hundreds of times lower power — without sacrificing the ability to create deep learning systems.

The trick was to replace conventional von Neumann architecture, which is "constrained by the time and energy spent moving data back and forth between the memory and the processor (the 'von Neumann bottleneck')," the researchers explain in the paper. "By contrast, in a non-von Neumann scheme, computing is done at the location of the data [in memory], with the strengths of the synaptic connections (the 'weights') stored and adjusted directly in memory.

Equivalent-accuracy accelerated neural-network training using analogue memory (DOI: 10.1038/s41586-018-0180-5) (DX)


Original Submission

IEEE Releases the International Roadmap for Devices and Systems (IRDS) 9 comments

Submitted via IRC for takyon

IEEE, the world's largest technical professional organization dedicated to advancing technology for humanity, today announced the release of the 2017 edition of the International Roadmap for Devices and Systems (IRDS), building upon 15 years of projecting technology needs for evolving the semiconductor and computer industries. The IRDS is an IEEE Standards Association (IEEE-SA) Industry Connections (IC) Program sponsored by the IEEE Rebooting Computing (IEEE RC) Initiative, which has taken a lead in building a comprehensive view of the devices, components, systems, architecture, and software that comprise the global computing ecosystem.

According to Paolo A. Gargini, IEEE and Japan Society of Applied Physics (JSAP) Fellow, and Chairman of IRDS, "Over the past decade the structure and requirements of the electronics industry have evolved well beyond the semiconductor's industry requirements. In line with the changes in the new electronics ecosystem, the 2017 IRDS has integrated system requirements with device requirements and identified some new powerful solutions that will support and revolutionize the electronics industry for the next 15 years."

According to William Tonti, IEEE Fellow and IEEE Future Directions Sr. Director, "The IRDS presents an end-to-end continuum of computing as requirements evolve into multiple platforms."

Source: https://www.hpcwire.com/off-the-wire/ieee-releases-the-international-roadmap-for-devices-and-systems-irds/


Original Submission

June 2018 TOP500 List: U.S. Claims #1 and #3 Spots 27 comments

The U.S. leads the June 2018 TOP500 list with a 122.3 petaflops system:

The TOP500 celebrates its 25th anniversary with a major shakeup at the top of the list. For the first time since November 2012, the US claims the most powerful supercomputer in the world, leading a significant turnover in which four of the five top systems were either new or substantially upgraded.

Summit, an IBM-built supercomputer now running at the Department of Energy's (DOE) Oak Ridge National Laboratory (ORNL), captured the number one spot with a performance of 122.3 petaflops on High Performance Linpack (HPL), the benchmark used to rank the TOP500 list. Summit has 4,356 nodes, each one equipped with two 22-core Power9 CPUs, and six NVIDIA Tesla V100 GPUs. The nodes are linked together with a Mellanox dual-rail EDR InfiniBand network.

[...] Sierra, a new system at the DOE's Lawrence Livermore National Laboratory took the number three spot, delivering 71.6 petaflops on HPL. Built by IBM, Sierra's architecture is quite similar to that of Summit, with each of its 4,320 nodes powered by two Power9 CPUs plus four NVIDIA Tesla V100 GPUs and using the same Mellanox EDR InfiniBand as the system interconnect.

The #100 system has an Rmax of 1.703 petaflops, up from 1.283 petaflops in November. The #500 system has an Rmax of 715.6 teraflops, up from 548.7 teraflops in June.

273 systems have a performance of at least 1 petaflops, up from 181 systems. The combined performance of the top 500 systems is 1.22 exaflops, up from 845 petaflops.

On the Green500 list, Shoubu system B's efficiency has been adjusted to 18.404 gigaflops per Watt from 17.009 GFLOPS/W. The Summit supercomputer, #1 on TOP500, debuts at #5 on the Green500 with 13.889 GFLOPS/W. Japan's AI Bridging Cloud Infrastructure (ABCI) supercomputer, #5 on TOP500 (19.88 petaflops Rmax), is #8 on the Green500 with 12.054 GFLOPS/W.

Previously: TOP500 List #50 and Green500 List #21: November 2017


Original Submission

Getting to Zettascale Without Needing Multiple Nuclear Power Plants 7 comments

Getting To Zettascale Without Needing Multiple Nuclear Power Plants:

There's no resting on your laurels in the HPC world, no time to sit back and bask in a hard-won accomplishment that was years in the making. The ticker tape has only now been swept up in the wake of the long-awaited celebration last year of finally reaching the exascale computing level, with the Frontier supercomputer housed at the Oak Ridge National Labs breaking that barrier.

With that in the rear-view mirror, attention is turning to the next challenge: Zettascale computing, some 1,000 times faster than what Frontier is running. In the heady months after his heralded 2021 return to Intel as CEO, Pat Gelsinger made headlines by saying the giant chip maker was looking at 2027 to reach zettascale.

Lisa Su, the chief executive officer who has led the remarkable turnaround at Intel's chief rival AMD, took the stage at ISSCC 2023 to talk about zettascale computing, laying out a much more conservative – some would say reasonable – timeline.

Looking at supercomputer performance trends over the past two-plus decades and the ongoing innovation in computing – think advanced package technologies, CPUs and GPUs, chiplet architectures, the pace of AI adoption, among others – Su calculated that the industry could reach the zettabyte scale within the next 10 years or so.

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 1, Funny) by Anonymous Coward on Wednesday July 11 2018, @07:19PM (5 children)

    by Anonymous Coward on Wednesday July 11 2018, @07:19PM (#705891)

    You must unlearn what you have learned.

    • (Score: 2) by takyon on Wednesday July 11 2018, @07:46PM (4 children)

      by takyon (881) <{takyon} {at} {soylentnews.org}> on Wednesday July 11 2018, @07:46PM (#705905) Journal

      Will our post-yottascale artilect gods have a sense of humor when they boss us around?

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 0) by Anonymous Coward on Wednesday July 11 2018, @08:10PM

        by Anonymous Coward on Wednesday July 11 2018, @08:10PM (#705907)

        Not if their hair is pointy.

      • (Score: 2) by maxwell demon on Wednesday July 11 2018, @08:14PM (1 child)

        by maxwell demon (1608) on Wednesday July 11 2018, @08:14PM (#705909) Journal

        Will our post-yottascale artilect gods have a sense of humor when they boss us around?

        From the summary:

        cooled to four Kelvins

        So they will be rather chilly gods.

        --
        The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 2) by inertnet on Wednesday July 11 2018, @09:10PM

        by inertnet (4071) on Wednesday July 11 2018, @09:10PM (#705934) Journal

        A group of people is creating an imaginary future that has such artificial gods: Orion's Arm [orionsarm.com]

  • (Score: 3, Interesting) by bitstream on Wednesday July 11 2018, @08:38PM (5 children)

    by bitstream (6144) on Wednesday July 11 2018, @08:38PM (#705918) Journal

    So how do you instruct this "non-von Neumann architecture" to do what you want?

    How does one interact? VT100 ?

    • (Score: 2) by Arik on Wednesday July 11 2018, @08:46PM (3 children)

      by Arik (4543) on Wednesday July 11 2018, @08:46PM (#705923) Journal
      I expect it has ethernet or better, you can prepare batch jobs on any workstation you like, and then send them to the supercomputer when you're ready to run them. Just a guess though.
      --
      If laughter is the best medicine, who are the best doctors?
      • (Score: 2) by bitstream on Wednesday July 11 2018, @09:10PM (2 children)

        by bitstream (6144) on Wednesday July 11 2018, @09:10PM (#705935) Journal

        But how are those batch jobs built up? like imperative languages where write what to do? functional languages where write what to solve? or hardware description language (HDL) where you write how it shall pipe it all together?

        • (Score: 2) by Arik on Wednesday July 11 2018, @09:50PM (1 child)

          by Arik (4543) on Wednesday July 11 2018, @09:50PM (#705956) Journal
          As far as I know they typically support most if not all semi-modern languages. Fortran is likely to be most cost effective for most jobs though. And with a one-off you might expect some peculiarity. They'll have to have special libraries for whatever language you use if it needs to take full advantage of multithreading etc.
          --
          If laughter is the best medicine, who are the best doctors?
          • (Score: 3, Funny) by DECbot on Thursday July 12 2018, @12:19AM

            by DECbot (832) on Thursday July 12 2018, @12:19AM (#706004) Journal

            Java. Because just for once I want to see a native java application run just as fast as a compiled C application on my laptop.
             
            And Java because of vulnerabilities, because when the JVM does get exploited, I want the script kiddie to be overwhelmed by the system he just poned.

            --
            cats~$ sudo chown -R us /home/base
    • (Score: 2) by jasassin on Thursday July 12 2018, @09:02AM

      by jasassin (3566) <jasassin@gmail.com> on Thursday July 12 2018, @09:02AM (#706144) Homepage Journal

      So how do you instruct this "non-von Neumann architecture" to do what you want?
      How does one interact? VT100 ?

      Hah. We're not talking peta-flops here, we're talking yotta-flops. That's going to require VT220. Anyone could figure that out!

      --
      jasassin@gmail.com GPG Key ID: 0xE6462C68A9A3DB5A
  • (Score: 4, Interesting) by Arik on Wednesday July 11 2018, @08:43PM (10 children)

    by Arik (4543) on Wednesday July 11 2018, @08:43PM (#705921) Journal
    I'm no expert, feel free to chime in and tell me if I'm wrong, particularly if you are.

    But my impression of what's gone on for the last couple of decades is that, back with RISC if not before, we got to this place where we had good fast hardware - but to use it properly required programmers learning to think differently. That proved incredibly unpopular, so what the "market leaders" provided instead amounts to something roughly like the old Transmeta project that Linus T. worked on - that specific project went away but all the big players licensed their 'IP' and if I'm not badly mistaken all the currently produced desktop CPUs are built on it. And THAT was a RISC core hidden behind an on-the-fly translation layer that attempted to re-write standard x86 (machine) language into RISC on the fly, before handing them to the actual core.

    There are a number of reasons why this just can't work all that well, yet despite them, Transmeta (with some help from Mr Torvalds) managed to hack that concept into a processor that was actually sort of competitive in the low-power segment. And those hacks are what Intel and the rest bought, and so today these cores, presumably somewhat RISCish though a black box with ultimately unknown contents, are chugging away doing the work speaking a language that no programmer speaks. Every line of code, EVEN if you code in binary, is being machine translated into an entirely different language before it's translated. The two languages are so different that multiple translations are prepared and executed before the processor can figure out which one to use.

    It seems an incredibly indirect process, the programmers have grown so remote from the hardware they can't possibly know what their code really does. Why not just learn the new language? Is RISC machine language really that much harder than x86? If so, why? If not, what motivated the folks that made the decisions here to go down this road rather than focusing on improving compilers and development tools for less byzantine architectures?
    --
    If laughter is the best medicine, who are the best doctors?
    • (Score: 3, Interesting) by bitstream on Wednesday July 11 2018, @09:23PM (6 children)

      by bitstream (6144) on Wednesday July 11 2018, @09:23PM (#705940) Journal

      Compatibility (tm), No one ever got fired for buying IBM. Everybody uses Microshoaft etc. And so computers that were IBM PC x86 compatible got sold. Because people in big blue suits with tie, long tie. Made the buying decisions. And buyers being lazy. But all this compatibility with a freak architecture from 1972 designed to power the Datapoint 2200 terminal came with a complexity and speed cost.

      So when embedded and mobile phones started. They had no baggage and wanted the most joules per computing. Thus ARM. But there's been others like MIPS and SPARC. But they are not big because said suits (tm). The DEC Alpha processors where innovative but Intel assimilated them ie resistance was futile, then.

      If you want to screw the x86 hegemony, you will have to subvert the framework supporting it.

      • (Score: 2) by Arik on Wednesday July 11 2018, @10:49PM (5 children)

        by Arik (4543) on Wednesday July 11 2018, @10:49PM (#705968) Journal
        While I have no doubt there's truth there and that's part of the story, it still seems very incomplete.

        I'd love to hear from someone that has experience programming a RISC or RISC-ish chip in machine language, how that compares to x86 (or z80 which is nearly the same thing.)
        --
        If laughter is the best medicine, who are the best doctors?
        • (Score: 1, Interesting) by Anonymous Coward on Wednesday July 11 2018, @11:01PM (4 children)

          by Anonymous Coward on Wednesday July 11 2018, @11:01PM (#705973)

          Short version:

          If you really dive into RISC asm, the sensible way is to use a macro assembler that lets you pretty much write your own domain-specific language to implement whatever you want.

          In a way, it's like specifying the microcode that you need.

          On the other hand, non-RISC sort of hands you the language you'll use, but it's always the same, because the chip incorporates it.

          You get inexperienced programmers that try to do everything in RISC assembler as if they were doing non-RISC assembler, and you can always find their offices because of the whimpers of pain from self-inflicted wounds.

          • (Score: 2) by Arik on Thursday July 12 2018, @12:28AM (3 children)

            by Arik (4543) on Thursday July 12 2018, @12:28AM (#706012) Journal
            Ok, AC, you *sound* like you might be talking from experience.

            Here's my problem following you.

            "If you really dive into RISC asm, the sensible way is to use a macro assembler that lets you pretty much write your own domain-specific language to implement whatever you want."

            But that's exactly what I remember doing on x86! So I'm afraid I still can't quite see the difference you're getting at. You might sometimes have to macro two commands on RISC where one would suffice on x86, that I get, but you're going to be writing macros (or using compilers) for the most part with either one so how is that such a big deal?

            --
            If laughter is the best medicine, who are the best doctors?
            • (Score: 0) by Anonymous Coward on Thursday July 12 2018, @05:09AM (2 children)

              by Anonymous Coward on Thursday July 12 2018, @05:09AM (#706108)

              Yes, I am.

              I do realise what you mean, but it's another meta-level down.

              If you're using a compiled language, it's basically not a difference from the programmer's perspective. Write your code, crank the compiler, debug. Same as always. The relevant differences are hidden from you.

              But if you're bashing the metal, the difference can be expressed if you remember the original battle-cry of the RISC partisans; that microcode isn't magic. Remember also that they were drawing a comparison between chips like the MIPS, and the old IBM mainframe setup, where the instruction set was a complex beast full of microcode written with the express purpose of making life easier for applications programmers, doing things like automatically running down an array of values in RAM and incrementing them, or other fancy tasks like that. The RISC partisans were pointing out that the ALU underlying it was programmable to do that without the microcode, and their contention was that compilers were great enough that you'd never have to worry about the microcode-equivalent layer anyway.

              In a RISC chip the idea was that each cycle was one instruction, and always explicit, whereas in old-school chips, one instruction may in fact take a number of cycles while microcode ran.

              So what you're doing if you are bashing the metal in a RISC chip is kind of custom-coding at the microcode level, writing (in macro form) the microcode that you want.

              Does that make it clearer?

              • (Score: 1) by Arik on Thursday July 12 2018, @05:16AM

                by Arik (4543) on Thursday July 12 2018, @05:16AM (#706112) Journal
                "Does that make it clearer?"

                A bit perhaps, I think I will need to digest that and this article I have in another tab on microcode before it really makes sense to me, but I have something to chew on.

                Thanks.
                --
                If laughter is the best medicine, who are the best doctors?
              • (Score: 3, Interesting) by stormwyrm on Thursday July 12 2018, @06:41AM

                by stormwyrm (717) on Thursday July 12 2018, @06:41AM (#706124) Journal

                This reminds me of a very old DDJ article [ec.gc.ca] about the right way to program a PC+i860 system as a supercomputing system. Basically the ideas from the article involved using the i860 RISC as though it were a processor that had programmable microcode. They had a hellishly optimised interpreter for a sort of custom instruction set geared for the various high-level languages available (C and FORTRAN were mentioned) such that the interpreter's core loop almost completely fit inside the i860's onboard cache. This turned the i860 into a sort of custom processor for a specialised instruction set. No I/O was being done by the i860 code at all, that being entirely the task of the x86 half of the system. The article is 26 years old but it still makes for interesting reading.

                --
                Numquam ponenda est pluralitas sine necessitate.
    • (Score: 1, Interesting) by Anonymous Coward on Wednesday July 11 2018, @11:18PM (2 children)

      by Anonymous Coward on Wednesday July 11 2018, @11:18PM (#705976)

      The Intel 8080 and 8086 were microcoded [clemson.edu].

      • (Score: 0) by Anonymous Coward on Thursday July 12 2018, @12:44AM

        by Anonymous Coward on Thursday July 12 2018, @12:44AM (#706018)

        Thanks what a great article

      • (Score: 2) by Arik on Friday July 13 2018, @03:15AM

        by Arik (4543) on Friday July 13 2018, @03:15AM (#706488) Journal
        I don't have a problem with microcode per se, as long as it's documented fully.

        But undocumented microcode running in a black box is a pig in a poke.
        --
        If laughter is the best medicine, who are the best doctors?
(1)