Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Tuesday May 14 2019, @07:43PM   Printer-friendly
from the gate-twiddling dept.

I'm tired of the dominance of the out-of-order processor. They are large and wasteful, the ever-popular x86 is especially poor, and they are hard to understand. Their voodoo would be more appreciated if they pushed better at the limits of computation, but it's obvious that the problems people solve have a latent inaccessible parallelism far in excess of what an out-of-order core can extract. The future of computing should surely not rest on terawatts of power burnt to pretend a processor is simpler than it is.

There is some hope in the ideas of upstarts, like Mill Computing and Tachyum, as well as research ideas like CG-OoO. I don't know if they will ever find success. I wouldn't bet on it. Heck, the Mill might never even get far enough to have the opportunity to fail. Yet I find them exciting, and much of the offhand "sounds like Itanium" naysay is uninteresting.

This article focuses on architectures in proportion to how much creative, interesting work they've shown in public. This means much of this article comments on the Mill architecture, there is a healthy amount on CG-OoO, and the Tachyum is mentioned only in passing.

https://medium.com/@veedrac/to-reinvent-the-processor-671139a4a034

A commentary on some of the more unusual OoO architectures in the works with focus on Mill Computing's belt machines.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by RamiK on Tuesday May 14 2019, @10:48PM (2 children)

    by RamiK (1813) on Tuesday May 14 2019, @10:48PM (#843631)

    A practical proof of such suggestions is easy

    WIP [millcomputing.com] by the looks of things.

    increasing ability to optimize suboptimal code...requires really special compilers

    They've ported LLVM -and refactored a lot of it to be able to target their spec-driven ISA- and are busy porting the C++ standard library and a few other libraries while also writing a testing micro-kernel that can take advantage of the architacture's fat pointers while running real world open source software. The post from above also mentioned they're messing around with micropython while still fixing stuff in the simulator so they're figuring out other compilers and dynamic stuff too.

    32-bit x86 might be archaic, but it is still about the most dense representation of program flow around

    Thumb is considered as dense as i386 and I believe one of the RISC-V's variants matched that in real world comparisons.

    If the startups can compile C, they also can recompile x86 (possibly even in software), like Transmeta tried.

    Very likely. After all, Intel is decoding those x86 instructions live to their microarchitecture instructions.

    In nerdy smalltalk I generally claim that 100 MHz are technically enough for everything, and beyond that all workloads can get parallelized.

    I highly doubt this. There were many attempts at generally parallelizing descriptive documents and replacing postscript even before HTML/JS' XML poo came into place and nothing really worked.

    --
    compiling...
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by Rich on Wednesday May 15 2019, @12:28AM (1 child)

    by Rich (945) on Wednesday May 15 2019, @12:28AM (#843650) Journal

    They seem to count cycles against an x86 simulator, but I guess that's a bit away from the opaque ways a modern desktop CPU really works internally.

    I wonder what workload you imagine that couldn't be handled with 100 MHz at one 32 bit IPC and the option to parallelize out?!

    Postscript rendering itself could be nicely broken up. One CPU per path segment could fill the edge tables and one CPU per scanline can fill the spans with gradients. And if you go for compositing bitmaps, the shapes can be parallelized, too. It's a matter of memory bandwidth, and synchronizing shared access itself is a hard task (and might be so hard that new architectures lose their decisive battle here), but there's no inherent single-thread bottleneck anywhere in there.

    Out of my gut, I'd estimate the "lowest overall effort" sweet spot for a performance-tuned software foundation somewhere around a 7 stage in-order CPU with 4 execution units (complex and simple alu, loadstore, branch) that are shared between 2 SMT files (to migitate stalls wasting area). And then two of those. It could be that the required cache to handle the common workloads needs so much silicon area, though, that making the CPU more complex wouldn't matter, but I guess the cost/performance sensitive gaming boxes (e.g. Wii) are near my estimate and somewhat validate it. (Also i'm strangely fond of the RK3399's specs, and not just because it has internal kit to do low-latency 16 channel audio...).

    • (Score: 2) by RamiK on Wednesday May 15 2019, @11:21AM

      by RamiK (1813) on Wednesday May 15 2019, @11:21AM (#843776)

      I'd estimate the "lowest overall effort" sweet spot...

      It's pretty close to the most recent MIPS and the in-order ARM cores.

      One CPU per path segment could fill the edge tables and one CPU per scanline can fill the spans with gradients. And if you go for compositing bitmaps, the shapes can be parallelized, too.

      Ghostscript has multithreaded support since forever and it didn't help since the cursor writes are relative to the previous write and linear ( https://ghostscript.com/pipermail/gs-devel/2011-May/008970.html [ghostscript.com] ). TeX might have a better chance due to the boxes model but I think Knuth would have done as much by now if it was possible.

      There was an unrelated essay regarding from the HP labs back in the day about instruction-level parallelism that was evaluating postscript and specifying all the issues and why it's a good candidate... Can't find it through.

      It's a matter of memory bandwidth, and synchronizing shared access itself is a hard task

      That's like saying the only reason we can't run fast enough is because of friction. Locks and cycles lost to IPC on the one hand and the lacking width and depth of the branch predictor on the other represent real hardware limits that are at the core of the issue much like how slow the cache is. Intel hit the depth ceiling with the Pentium 4 and is now hitting it again with its more recent cores. They even managed to expand to 5 wide with certain instructions using some crazy voodoo I can't follow... Regardless, the solutions space for all of this following the failed Itanium is the topic discussed. Mill especially exposes enough of the pipelines and goes wide enough precisely to solve instruction-level parallelism. They're even talking about 0 cost hardware thread IPCs which is basically why they're writing their own kernel... But there some drawbacks raised in the article that leave more room for different designs which in a way is actually reassuring since it means we'll be seeing different parties coming up with solutions in the next few years.

      Also I'm strangely fond of the RK3399's specs, and not just because it has internal kit to do low-latency 16 channel audio...

      I don't get why everyone like those Cortex-A72 so much. I liked the original RasPis well enough when they came out for the simple stuff they've done well. But at these frequencies I'm WAY too lazy to optimize around 3-wide to save $5 BoM for no real performance/power gains. I'll pick up a Cortex-A76 sbc/laptop when they get cheap enough. And yeah, it will have a nice DSP I'll abuse for guitar effects most likely :D

      --
      compiling...