Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Tuesday May 14 2019, @07:43PM   Printer-friendly
from the gate-twiddling dept.

I'm tired of the dominance of the out-of-order processor. They are large and wasteful, the ever-popular x86 is especially poor, and they are hard to understand. Their voodoo would be more appreciated if they pushed better at the limits of computation, but it's obvious that the problems people solve have a latent inaccessible parallelism far in excess of what an out-of-order core can extract. The future of computing should surely not rest on terawatts of power burnt to pretend a processor is simpler than it is.

There is some hope in the ideas of upstarts, like Mill Computing and Tachyum, as well as research ideas like CG-OoO. I don't know if they will ever find success. I wouldn't bet on it. Heck, the Mill might never even get far enough to have the opportunity to fail. Yet I find them exciting, and much of the offhand "sounds like Itanium" naysay is uninteresting.

This article focuses on architectures in proportion to how much creative, interesting work they've shown in public. This means much of this article comments on the Mill architecture, there is a healthy amount on CG-OoO, and the Tachyum is mentioned only in passing.

https://medium.com/@veedrac/to-reinvent-the-processor-671139a4a034

A commentary on some of the more unusual OoO architectures in the works with focus on Mill Computing's belt machines.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Interesting) by bob_super on Tuesday May 14 2019, @08:59PM (12 children)

    by bob_super (1357) on Tuesday May 14 2019, @08:59PM (#843587)

    > We x86 haters despise the complexities and over-engineering aspect of OoO and branch predictors

    Sure. If you can design a multi-GHz processor in which Every Single Resource has the same execution time, I'll be glad to run single-threaded in-order code.
    In the meantime, the rest of us enjoy that you can dispatch another 4 unrelated operations while waiting for that double-precision divide result, let alone for RAM (or disk, if you don't multi-thread).

    And getting rid of branch predictions is only efficient if you don't have a pipeline, which get entertaining.

    x86 sucks, but most big features of modern x64 are justified regardless of the instruction set.

    Starting Score:    1  point
    Moderation   +1  
       Interesting=1, Total=1
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 2) by Immerman on Wednesday May 15 2019, @12:24AM (11 children)

    by Immerman (3985) on Wednesday May 15 2019, @12:24AM (#843649)

    If I understand correctly, the argument is not against parallelization or pipelining itself, but that the CPU is making the decisions on how to do that on the fly, rather than exposing the functionality to be optimized by the compiler, with the much greater contextual information at its disposal.

    • (Score: 2) by bob_super on Wednesday May 15 2019, @12:36AM (5 children)

      by bob_super (1357) on Wednesday May 15 2019, @12:36AM (#843652)

      A lot of stuff cannot be decided before run-time. The range of possibilities is just too big.
      So you still need the run-time hardware.
      Shouldn't prevent compilers from getting better, but you can't displace the hardware.

      • (Score: 2) by Immerman on Wednesday May 15 2019, @12:44AM (4 children)

        by Immerman (3985) on Wednesday May 15 2019, @12:44AM (#843654)

        Care to give an example?

        The time it takes a CPU to complete an instruction is pretty much written in stone - at least for any given CPU. The time required to retrieve data from RAM is more variable, especially for a parallel processor, but optimizing around some worst-case-scenario assumptions with the full-program contextual information and performance profiling is still likely to be at least competitive with what a CPU can do on the fly.

        • (Score: 0) by Anonymous Coward on Wednesday May 15 2019, @12:55AM (1 child)

          by Anonymous Coward on Wednesday May 15 2019, @12:55AM (#843658)

          You cannot "optimize around" things designed to not be predictable.

          • (Score: 2) by Immerman on Wednesday May 15 2019, @02:38AM

            by Immerman (3985) on Wednesday May 15 2019, @02:38AM (#843677)

            Except that happens entirely invisibly to the program, and unless I'm very much mistaken, should not be relevant to optimization. Putting executables in random places in memory has negligible effect on memory access times (instructions access patterns within an executable or library are the same, only the absolute memory location is changed), nor on instruction execution order.

        • (Score: 0) by Anonymous Coward on Wednesday May 15 2019, @05:00AM

          by Anonymous Coward on Wednesday May 15 2019, @05:00AM (#843698)

          Cache access can be pretty unpredictable for certain applications. Floating point operations when denorms are possible can vary in execution time. Just those off the top of my mind. If scheduling was easy, processors likely would all be modeled off of itanium.

        • (Score: 0) by Anonymous Coward on Friday May 17 2019, @10:49AM

          by Anonymous Coward on Friday May 17 2019, @10:49AM (#844659)

          The time it takes a CPU to complete an instruction is pretty much written in stone - at least for any given CPU.

          You can actually write that and not see the problem already? How many different Intel and AMD CPU families are out there at the moment? How many ARM families? Will those cycles/times always be the same for future generations of your wonderful no OOO CPUs?

          In the real world not many people use stuff like Gentoo and keep recompiling everything for their systems.

          See also: https://www.agner.org/optimize/instruction_tables.pdf [agner.org]

          Some instructions might take the same number of cycles for all families, but will enough of them do so? Just merely comparing CALL and RET cycles and you'll see many have differences.

          CPUs that require and rely on "clever" compilers to extract performance out of their hardware for "general computing" tend to have problems when their new generations have significantly different hardware. It's not a big problem for stuff like hardware support for more specialized stuff like AES or SHA acceleration when > 99% of the time you won't need that acceleration, but you have a problem when >99% of the time you need the compiler cleverness to get the 20% extra speed in "general computing".

    • (Score: 2) by RS3 on Wednesday May 15 2019, @12:59AM (4 children)

      by RS3 (6367) on Wednesday May 15 2019, @12:59AM (#843659)

      I agree. Some of the CPU's running optimizations are pretty much hardware / microcode specific, for instance, the compiler can't do branch prediction, pretty much by definition. I've felt for many years that there's a bit of a tug-of-war between compiler optimizations and CPU enhancements.

      My hunch is that a better system would involve a significantly different approach to CPU instruction sets, with more control over internal CPU logic. RISC kind of does this by default- very simple CPU, more optimization done in compiler.

      I like the idea of a RISC main integer unit, with math, string, vector, (FPU, MMX, SSE, etc.) done in parallel in on-chip co-processors, keeping in mind multiple cores are there to run other threads while FPU calculates something, for example. And of course more of the vector operations are being done in GPUs now, so CISC is becoming less relevant IMHO.

      • (Score: 2) by Immerman on Wednesday May 15 2019, @02:59AM (3 children)

        by Immerman (3985) on Wednesday May 15 2019, @02:59AM (#843681)

        Of course the compiler can do branch prediction, you just need to do performance profiling first. Heck, that was something any decent coder used to do by hand as a matter of course: Any performance-relevant branch statement statement should be structured to usually follow the first available option so that the pipeline wasn't disrupted by an unnecessary branch. In essence, performance-critical code should always be written as:
        if(usually true){
        Do most common code-path
        }else{
        unusual code path that will inherently cause a pipeline flush because you took the non-first branch to get here
        }

        Now granted, that can't adapt on the fly to changing probability distributions, but it's fairly rare code where the probability distributions change significantly on the fly.

        As for RISC, as I recall OoO execution and branch prediction thrived there as well, as the name implied RISC was more about the instruction set than the microcode to execute it.

        • (Score: 2) by Immerman on Wednesday May 15 2019, @03:01AM

          by Immerman (3985) on Wednesday May 15 2019, @03:01AM (#843682)

          That should have been
          >...so that the pipeline wasn't disrupted by an unnecessary conditional jump

        • (Score: 2) by maxwell demon on Wednesday May 15 2019, @07:31AM (1 child)

          by maxwell demon (1608) on Wednesday May 15 2019, @07:31AM (#843722) Journal

          Of course the compiler can do branch prediction, you just need to do performance profiling first.

          That's assuming you can accurately predict the data patterns that will go into the program. What if a program is used with two very different patterns? For example, I could imagine that some loops in video encoders show very different behaviour whether they encode live action recordings or 2D cartoons.

          --
          The Tao of math: The numbers you can count are not the real numbers.
          • (Score: 2) by RS3 on Thursday May 16 2019, @07:32PM

            by RS3 (6367) on Thursday May 16 2019, @07:32PM (#844423)

            Yeah, source code and compiler optimizations are great, but CPU branch prediction is a different thing. CPU branch prediction is a special form of parallel processing where alternate circuits in the CPU pre-fetch code and data that might be needed in the branch, while main CPU circuit is doing something. "Super-scalar" CPUs have been pre-fetching data and code for a long time because CPU I/O (external access) is not as fast as internal CPU speeds, so while CPU is crunching 1 thing, while I/O is available, pre-fetch circuits grab what they can (read-ahead cache load).

            I still have not gotten a clear answer but I speculate the problem is that the CPU pulls in code and data for process A, then OS context-switches control to process B, but oops, process A's code and data are still in internal CPU cache, and oops, B owns the CPU and can read A's code and data. The kernel fixes seem to be to flush cache and pre-fetch queues frequently and certainly on context switches, and that helps, but doesn't cover all cases. It seems like the CPU should do that on its own, but I have to think about whether the CPU knows its context; probably doesn't matter. If CPU knows GDT entry from which barrel load came, and new context is different / protected, then flush cache. Gotta think... later...