Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Tuesday August 23 2016, @12:41AM   Printer-friendly
from the we've-come-a-long-way-since-the-8087 dept.

ARM licensees will be able to operate on 16 times more information per vector per instruction than before, using ARM's new Scalable Vector Extensions:

Today ARM is announcing an update to their line of architecture license products. With the goal of moving ARM more into the server, the data center, and high-performance computing, the new license add-on tackles a fundamental data center and HPC issue: vector compute. ARM v8-A with Scalable Vector Extensions won't be part of any ARM microarchitecture license today, but for the semiconductor companies that build their own cores with the instruction set, this could see ARM move up into the HPC markets. Fujitsu is the first public licensee on board, with plans to include ARM v8-A cores with SVE in the Post-K RIKEN supercomputer in 2020.

Scalable Vector Extensions (SVE) will be a flexible addition to the ISA, and support from 128-bit to 2048-bit. ARM has included the extensions in a way that if included in the hardware, the hardware is scalable: it doesn't matter if the code being run calls for 128-bit, 512-bit or 2048-bit, the scheduler will arrange the calculations to compensate for the hardware that is available. Thus a 2048-bit code run on a 128-bit SVE core will manage the instructions in such a way to complete the calculation, or a 128-bit code on a 2048-bit core will attempt to improve IPC by bundling 128-bit calculations together. ARM's purpose here is to move the vector calculation problem away from software and into hardware.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Tuesday August 23 2016, @06:48AM

    by Anonymous Coward on Tuesday August 23 2016, @06:48AM (#392016)

    Yeah!! 2048, take that, Intel! For sure!

    Riiiiight, they say it's scalable, but I won't believe it until I can test it myself. Intel made a big stink about extending AVX to 256 bits and 512 bits, but it's mostly gimmicky bullshit. AVX-128 is still the fastest. Which isn't to say it's completely gimmicky bullshit, because there are still more registers available, and more registers are better, but using them to their full width doesn't provide better performance.

    Don't trust, Verify.

    • (Score: 3, Informative) by TheRaven on Tuesday August 23 2016, @08:10AM

      by TheRaven (270) on Tuesday August 23 2016, @08:10AM (#392030) Journal

      You might like to look at the research behind it. Krste Asanović's group at Berkeley has been pushing this kind of design for a long time (amusingly, RISC-V began life as a toy core that was just complex enough to do the general purpose compute to push data to and from the experimental vector units). Abstracting the ISA from the microachitecture is generally considered a good thing and a number of researchers have been arguing that fixed-width vectors are a bad design. Oh, and Cray was building supercomputers with a similar design of vector unit decades ago.

      Intel has actually done something similar in the past too. Some of the Atom cores only supported 64-bit vectors and so 128-bit SSE operations were dispatched over two cycles (most of them are multi-cycle operations, so this didn't give too much of a slowdown unless your code was already packing the SSE instructions tightly enough to saturate the vector units on a Xeon).

      This kind of decoupling is very important for ARM, because one of their key selling points is the ability to have a single ecosystem (for things like compilers and operating systems) for everything from high-end embedded uses[1] all of the way up to server and supercomputer chips. This kind of design will allow supercomputer chips to include 2048-bit vector lanes, small low-power chips to use 64-bit data paths, and the same code to run on both (though, obviously, at very different speeds).

      [1] The low-end embedded chips are all M-profile, which shares a lot of the compiler infrastructure but doesn't have an MMU and typically doesn't run a conventional OS.

      --
      sudo mod me up
    • (Score: 2) by NCommander on Tuesday August 23 2016, @11:32AM

      by NCommander (2) Subscriber Badge <michael@casadevall.pro> on Tuesday August 23 2016, @11:32AM (#392072) Homepage Journal

      AltiVec on PowerPC was mostly disused because its difficult to create a compiler that can auto-vectorize C code. Itanium basically suffered this problem on an architecutal basis since the entire thing is basically vector processing.

      --
      Still always moving
      • (Score: 2) by RamiK on Tuesday August 23 2016, @02:32PM

        by RamiK (1813) on Tuesday August 23 2016, @02:32PM (#392141)

        Worth linking: http://llvm.org/docs/Vectorizers.html [llvm.org] & https://gcc.gnu.org/projects/tree-ssa/vectorization.html [gnu.org]

        For java, Intel pushed some vectorization support to hotspot (java) and there's still ongoing work. But the closed source Oracle stack is said to have decent & well rounded auto-vectorization.

        So, if I had to guess, these features are almost exclusively enterprise Java oriented. It's possible they even have specific applications in mind for this. Possibly in the financial sector...

        --
        compiling...
      • (Score: 2) by FatPhil on Tuesday August 23 2016, @06:19PM

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Tuesday August 23 2016, @06:19PM (#392225) Homepage
        Altivec was no harder to software-vectorise than SSE* were. SSE* succeeded in the marketplace, therefore automatic vectorisation was not the issue.

        Itanium was just VLIW, and so required heterogenous operation parallelism rather than the homogenous operation parallelism of vectorising, which is a slighlty easier problem, as you can mix-and-match what goes together (any input, any output, any mathematical calculation, any address generation, etc.), vectorising requires an exact match, and to be a win requires several exact matches. Itanium failed in the market place because they thought the best way to beat AMD was to start playing a completely different game, but when they realised that x86 support was necessary, they added support which was utterly lousy (way worse than DEC's Fx/32 x86 emulation on 2-generation-old chips).

        This new "advance" is just going back to the x86_64 throw-tons-of-gates-on-the-chip logic that VLIW was in part a reaction against.

        There are some who think that just makes hot expensive-to-cool server farms, and that something like the VLIW attitude of doing almost all of the scheduling decisions in the compiler is the way to go. The ones with the most groundbreaking ideas in that direction are the boffins behind the "Mill" architecture, which has decided to just reinvent *everything* from scratch, and therefore, if they ever reach tape-out will be the only really revolutionary computer architecture in about 4 decades. Something like 13 vids are now up on youtube, they're all worth a watch.
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves