Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Tuesday May 14 2019, @07:43PM   Printer-friendly
from the gate-twiddling dept.

I'm tired of the dominance of the out-of-order processor. They are large and wasteful, the ever-popular x86 is especially poor, and they are hard to understand. Their voodoo would be more appreciated if they pushed better at the limits of computation, but it's obvious that the problems people solve have a latent inaccessible parallelism far in excess of what an out-of-order core can extract. The future of computing should surely not rest on terawatts of power burnt to pretend a processor is simpler than it is.

There is some hope in the ideas of upstarts, like Mill Computing and Tachyum, as well as research ideas like CG-OoO. I don't know if they will ever find success. I wouldn't bet on it. Heck, the Mill might never even get far enough to have the opportunity to fail. Yet I find them exciting, and much of the offhand "sounds like Itanium" naysay is uninteresting.

This article focuses on architectures in proportion to how much creative, interesting work they've shown in public. This means much of this article comments on the Mill architecture, there is a healthy amount on CG-OoO, and the Tachyum is mentioned only in passing.

https://medium.com/@veedrac/to-reinvent-the-processor-671139a4a034

A commentary on some of the more unusual OoO architectures in the works with focus on Mill Computing's belt machines.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2, Insightful) by Anonymous Coward on Tuesday May 14 2019, @07:56PM (25 children)

    by Anonymous Coward on Tuesday May 14 2019, @07:56PM (#843558)

    I am very smart. I am signaling that I am very smart.

    • (Score: 2) by sshelton76 on Tuesday May 14 2019, @08:04PM (5 children)

      by sshelton76 (7978) on Tuesday May 14 2019, @08:04PM (#843565)

      Signalling yes. But like the old lady I got stuck behind on my way to work, the signal appears to not be relevant to person who activated it.

      These chips are what they are due to literal decades of incremental improvement. There are plenty of chip designs if you don't want wintel. You can go with ARM or MIPS, SPARC is also open source down to the transistor layer now. Furthermore, any modern FPGA is going to ship with at least one IP core in verilog you can modify to your hearts content.

      If you want what you say you want, and you are smart you will contribute to these projects. If you can't contribute intellectually, then at least contribute financially to projects seeking to break the x86 dominance.

      • (Score: 2) by Farkus888 on Tuesday May 14 2019, @08:25PM (4 children)

        by Farkus888 (5159) on Tuesday May 14 2019, @08:25PM (#843575)

        Yes but this article appears to be mostly for old folks based on my experience. It used to be computer folks learned from the electrical components up. By the time I was in school it was just card swapping. Basic electronics like what does a transistor do? What does a capacitor do? Can you identify them on that board? All completely skipped. We learned software down and never went deeper than the interfaces between the cards. As that knowledge gets lost this battle gets harder. Fewer people will understand the difference enough to form an opinion. For those who don't inertia is the most rational side to pick.

        • (Score: 2) by RamiK on Tuesday May 14 2019, @08:27PM

          by RamiK (1813) on Tuesday May 14 2019, @08:27PM (#843578)

          Fortunately Intel never fails to provide incentives to learn the subject properly: https://www.phoronix.com/scan.php?page=news_item&px=Microarch-Data-Sampling [phoronix.com]

          --
          compiling...
        • (Score: 0) by Anonymous Coward on Wednesday May 15 2019, @04:10PM

          by Anonymous Coward on Wednesday May 15 2019, @04:10PM (#843868)

          I had to learn electronics on my own but processor design was part of my CS program. It was very basic but one day I picked up a book from the library. Sat down and 4 hours of skip-reading I was caught up with the state of the art until at least 1999.
          I promise it's not voodoo check it out.
          https://www.ebay.com/p/Digital-Computer-Electronics-by-Jerald-A-Brown-and-Albert-P-Malvino-1992-Hardcover-Revised/506119 [ebay.com]

          This book is out of print but if you pick this up and install logisim you'll have a working processor by the end.
          The only reason I recommend it is because it's fantastically well written and easy to understand. It's not terribly large either
          You don't even need to know a lot of electronics, you can pick that up later.

        • (Score: 0) by Anonymous Coward on Wednesday May 15 2019, @04:13PM

          by Anonymous Coward on Wednesday May 15 2019, @04:13PM (#843870)

          That was viable back then in a way that it isn't now. Hardware is just that much more complicated. There was a time when you could build your own computer completely from the ground up, it would be more expensive and physically larger, but not necessarily that much slower. These days, there's no way you could do that without access to extremely expensive hardware.

          The problem IMHO, is more about the failure of the youngin's to learn from the history that produced what they're using. For instance, it took ages from the concept of a desktop computer as seen in the late '90s to develop. Yes, the analogy was there decades earlier, but there were tons of little adjustments made along the way that have been chucked in the trash because it's not sexy enough.

          I'd personally, settle for folks that just understand that a UI is supposed to be largely invisible to the user and that software doesn't need to use up all available resources.

        • (Score: 2) by driverless on Thursday May 16 2019, @11:22AM

          by driverless (4770) on Thursday May 16 2019, @11:22AM (#844224)

          It's not for old folks either, they remember that this guy, or one of his friends, have been peddling this vapourware for I-don't-know-how-many years without any sign of it getting past the vapour stage. Whenever there's sufficient news about some issue with any processor of any kind, they dust off their Mill propaganda and post it to anyone who'll listen.

          I think theMill's main claim to fame is that it'll be the processor that will be used to run Xanadu.

    • (Score: 3, Informative) by RamiK on Tuesday May 14 2019, @08:18PM (17 children)

      by RamiK (1813) on Tuesday May 14 2019, @08:18PM (#843572)

      Quite the opposite. We x86 haters despise the complexities and over-engineering aspect of OoO and branch predictors and the whole spirit behind Mill Computing is to restore sanity and find a way to expose the pipelines, go in-order and do away with branch-prediction as much as possible so the compiler will take over the scheduling.

      --
      compiling...
      • (Score: 0) by Anonymous Coward on Tuesday May 14 2019, @08:37PM (2 children)

        by Anonymous Coward on Tuesday May 14 2019, @08:37PM (#843581)

        You could have put down some basic design choices, you know, HOW it works differently from the mega-long-pipelines-with-speculative-execution in the summary. All you put in the summary, and your comment above, is acronyms and pr slogans.

        • (Score: 2) by RamiK on Tuesday May 14 2019, @09:13PM (1 child)

          by RamiK (1813) on Tuesday May 14 2019, @09:13PM (#843589)

          some basic design choices... All you put in the summary, and your comment above, is acronyms and pr slogans.

          It's like trying to explain a register machine to someone who only knows how stack machines work. Would it help saying it's like a VLIW with fat pointers and something similar to the Itanium's register rotation done everywhere? Personally this explanation did nothing so I ended up watching the videos and reading the papers and patent applications. And I still don't understand a lot of it for the simple fact they still haven't fully explained much of it while they're still releasing patents or how they're going to resolve the the optimization issues that held back the Itanium.

          Anyhow, the article spends over 1000 words covering and comparing the basic concepts between the Belt and OoO before even getting into the main subject so unless you want me to just paste all of that...

          --
          compiling...
          • (Score: 0) by Anonymous Coward on Tuesday May 14 2019, @10:32PM

            by Anonymous Coward on Tuesday May 14 2019, @10:32PM (#843629)

            I guess everything old is new again, when new generation of "investors" born after Itanic tanked is gearing to burn their $$$ without mommy's control at last.

      • (Score: 0) by Anonymous Coward on Tuesday May 14 2019, @08:42PM

        by Anonymous Coward on Tuesday May 14 2019, @08:42PM (#843582)

        OOO is not an x86 thing, it originated at IBM in the 1960s and plenty of 90's RISC (non-x86) design work went into improving OOO execution.

      • (Score: 3, Interesting) by bob_super on Tuesday May 14 2019, @08:59PM (12 children)

        by bob_super (1357) on Tuesday May 14 2019, @08:59PM (#843587)

        > We x86 haters despise the complexities and over-engineering aspect of OoO and branch predictors

        Sure. If you can design a multi-GHz processor in which Every Single Resource has the same execution time, I'll be glad to run single-threaded in-order code.
        In the meantime, the rest of us enjoy that you can dispatch another 4 unrelated operations while waiting for that double-precision divide result, let alone for RAM (or disk, if you don't multi-thread).

        And getting rid of branch predictions is only efficient if you don't have a pipeline, which get entertaining.

        x86 sucks, but most big features of modern x64 are justified regardless of the instruction set.

        • (Score: 2) by Immerman on Wednesday May 15 2019, @12:24AM (11 children)

          by Immerman (3985) on Wednesday May 15 2019, @12:24AM (#843649)

          If I understand correctly, the argument is not against parallelization or pipelining itself, but that the CPU is making the decisions on how to do that on the fly, rather than exposing the functionality to be optimized by the compiler, with the much greater contextual information at its disposal.

          • (Score: 2) by bob_super on Wednesday May 15 2019, @12:36AM (5 children)

            by bob_super (1357) on Wednesday May 15 2019, @12:36AM (#843652)

            A lot of stuff cannot be decided before run-time. The range of possibilities is just too big.
            So you still need the run-time hardware.
            Shouldn't prevent compilers from getting better, but you can't displace the hardware.

            • (Score: 2) by Immerman on Wednesday May 15 2019, @12:44AM (4 children)

              by Immerman (3985) on Wednesday May 15 2019, @12:44AM (#843654)

              Care to give an example?

              The time it takes a CPU to complete an instruction is pretty much written in stone - at least for any given CPU. The time required to retrieve data from RAM is more variable, especially for a parallel processor, but optimizing around some worst-case-scenario assumptions with the full-program contextual information and performance profiling is still likely to be at least competitive with what a CPU can do on the fly.

              • (Score: 0) by Anonymous Coward on Wednesday May 15 2019, @12:55AM (1 child)

                by Anonymous Coward on Wednesday May 15 2019, @12:55AM (#843658)

                You cannot "optimize around" things designed to not be predictable.

                • (Score: 2) by Immerman on Wednesday May 15 2019, @02:38AM

                  by Immerman (3985) on Wednesday May 15 2019, @02:38AM (#843677)

                  Except that happens entirely invisibly to the program, and unless I'm very much mistaken, should not be relevant to optimization. Putting executables in random places in memory has negligible effect on memory access times (instructions access patterns within an executable or library are the same, only the absolute memory location is changed), nor on instruction execution order.

              • (Score: 0) by Anonymous Coward on Wednesday May 15 2019, @05:00AM

                by Anonymous Coward on Wednesday May 15 2019, @05:00AM (#843698)

                Cache access can be pretty unpredictable for certain applications. Floating point operations when denorms are possible can vary in execution time. Just those off the top of my mind. If scheduling was easy, processors likely would all be modeled off of itanium.

              • (Score: 0) by Anonymous Coward on Friday May 17 2019, @10:49AM

                by Anonymous Coward on Friday May 17 2019, @10:49AM (#844659)

                The time it takes a CPU to complete an instruction is pretty much written in stone - at least for any given CPU.

                You can actually write that and not see the problem already? How many different Intel and AMD CPU families are out there at the moment? How many ARM families? Will those cycles/times always be the same for future generations of your wonderful no OOO CPUs?

                In the real world not many people use stuff like Gentoo and keep recompiling everything for their systems.

                See also: https://www.agner.org/optimize/instruction_tables.pdf [agner.org]

                Some instructions might take the same number of cycles for all families, but will enough of them do so? Just merely comparing CALL and RET cycles and you'll see many have differences.

                CPUs that require and rely on "clever" compilers to extract performance out of their hardware for "general computing" tend to have problems when their new generations have significantly different hardware. It's not a big problem for stuff like hardware support for more specialized stuff like AES or SHA acceleration when > 99% of the time you won't need that acceleration, but you have a problem when >99% of the time you need the compiler cleverness to get the 20% extra speed in "general computing".

          • (Score: 2) by RS3 on Wednesday May 15 2019, @12:59AM (4 children)

            by RS3 (6367) on Wednesday May 15 2019, @12:59AM (#843659)

            I agree. Some of the CPU's running optimizations are pretty much hardware / microcode specific, for instance, the compiler can't do branch prediction, pretty much by definition. I've felt for many years that there's a bit of a tug-of-war between compiler optimizations and CPU enhancements.

            My hunch is that a better system would involve a significantly different approach to CPU instruction sets, with more control over internal CPU logic. RISC kind of does this by default- very simple CPU, more optimization done in compiler.

            I like the idea of a RISC main integer unit, with math, string, vector, (FPU, MMX, SSE, etc.) done in parallel in on-chip co-processors, keeping in mind multiple cores are there to run other threads while FPU calculates something, for example. And of course more of the vector operations are being done in GPUs now, so CISC is becoming less relevant IMHO.

            • (Score: 2) by Immerman on Wednesday May 15 2019, @02:59AM (3 children)

              by Immerman (3985) on Wednesday May 15 2019, @02:59AM (#843681)

              Of course the compiler can do branch prediction, you just need to do performance profiling first. Heck, that was something any decent coder used to do by hand as a matter of course: Any performance-relevant branch statement statement should be structured to usually follow the first available option so that the pipeline wasn't disrupted by an unnecessary branch. In essence, performance-critical code should always be written as:
              if(usually true){
              Do most common code-path
              }else{
              unusual code path that will inherently cause a pipeline flush because you took the non-first branch to get here
              }

              Now granted, that can't adapt on the fly to changing probability distributions, but it's fairly rare code where the probability distributions change significantly on the fly.

              As for RISC, as I recall OoO execution and branch prediction thrived there as well, as the name implied RISC was more about the instruction set than the microcode to execute it.

              • (Score: 2) by Immerman on Wednesday May 15 2019, @03:01AM

                by Immerman (3985) on Wednesday May 15 2019, @03:01AM (#843682)

                That should have been
                >...so that the pipeline wasn't disrupted by an unnecessary conditional jump

              • (Score: 2) by maxwell demon on Wednesday May 15 2019, @07:31AM (1 child)

                by maxwell demon (1608) on Wednesday May 15 2019, @07:31AM (#843722) Journal

                Of course the compiler can do branch prediction, you just need to do performance profiling first.

                That's assuming you can accurately predict the data patterns that will go into the program. What if a program is used with two very different patterns? For example, I could imagine that some loops in video encoders show very different behaviour whether they encode live action recordings or 2D cartoons.

                --
                The Tao of math: The numbers you can count are not the real numbers.
                • (Score: 2) by RS3 on Thursday May 16 2019, @07:32PM

                  by RS3 (6367) on Thursday May 16 2019, @07:32PM (#844423)

                  Yeah, source code and compiler optimizations are great, but CPU branch prediction is a different thing. CPU branch prediction is a special form of parallel processing where alternate circuits in the CPU pre-fetch code and data that might be needed in the branch, while main CPU circuit is doing something. "Super-scalar" CPUs have been pre-fetching data and code for a long time because CPU I/O (external access) is not as fast as internal CPU speeds, so while CPU is crunching 1 thing, while I/O is available, pre-fetch circuits grab what they can (read-ahead cache load).

                  I still have not gotten a clear answer but I speculate the problem is that the CPU pulls in code and data for process A, then OS context-switches control to process B, but oops, process A's code and data are still in internal CPU cache, and oops, B owns the CPU and can read A's code and data. The kernel fixes seem to be to flush cache and pre-fetch queues frequently and certainly on context switches, and that helps, but doesn't cover all cases. It seems like the CPU should do that on its own, but I have to think about whether the CPU knows its context; probably doesn't matter. If CPU knows GDT entry from which barrel load came, and new context is different / protected, then flush cache. Gotta think... later...

    • (Score: 0) by Anonymous Coward on Tuesday May 14 2019, @10:26PM

      by Anonymous Coward on Tuesday May 14 2019, @10:26PM (#843627)

      Nope, you're not very smart at all.

  • (Score: 1, Informative) by Anonymous Coward on Tuesday May 14 2019, @08:26PM (6 children)

    by Anonymous Coward on Tuesday May 14 2019, @08:26PM (#843577)

    The reason x86 has become an 'API' is the wealth of existing software.

    Also, Intel did try a revolutionary new architecture, but it commercially failed miserably because it relied heavily (exclusively) on compiler optimization. Intel ended-up adding x86 support demanded by customers.

    Ultra-wide has also been tried (Transmeta) and they decided to also emulate x86(!). There were no native binutils; all that was proprietary trade-secrets.

    • (Score: 2) by HiThere on Tuesday May 14 2019, @08:58PM (2 children)

      by HiThere (866) Subscriber Badge on Tuesday May 14 2019, @08:58PM (#843586) Journal

      So, basically your point is that proprietary software prefers the X86 system, and FOSS develops slowly. Those are correct points, but they don't argue against the points of the summary. (He's not particularly optimistic about the success of the redesigned CPUs.)

      An interesting question is "What effect will the CPUs designed to optimize deep learning algorithms have?", but I don't even have an opinion about that.

      --
      Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
      • (Score: 1) by shrewdsheep on Wednesday May 15 2019, @09:31AM (1 child)

        by shrewdsheep (5215) on Wednesday May 15 2019, @09:31AM (#843747)

        "What effect will the CPUs designed to optimize deep learning algorithms have?", ... I am wondering on what? Isn't the really interesting question: "What will be the effect of deep learning algorithms used to optimize CPUs?" (on their performance)

        • (Score: 2) by HiThere on Wednesday May 15 2019, @04:49PM

          by HiThere (866) Subscriber Badge on Wednesday May 15 2019, @04:49PM (#843885) Journal

          It depends on which way you are looking. My guess is that if you measure bits processed per second you wouldn't see any change, but would you want to run an accounting program on a Turing machine?

          So. If you're using an algorithm that is inherently "embarrassingly parallel" on a single threaded machine, the time of execution will be a lot longer than if you run it on a strongly parallel machine. And deep learning can be designed in a way close to "embarrassingly parallel". There *are* choke points, but most of them can be worked around. Think of an n-way parallel message passing system with immutable messages. That can scale with the number of independent threads available. There *is* overhead, but not excessive. Each thread needs an input queue, and it needs to be able to determine the address of the threads that it wants to send messages to. (This isn't a deep learning specification, but it's a simplification I often use to think about it with.) A system designed for this kind of thing wouldn't work that well on an inherently sequential process, but it would work quite well for things like pattern recognition (and I don't mean regular expressions).

          I really doubt that this will become the dominant type of system, as language is an inherently serializing mode of thinking, including not only natural languages, but almost all computer languages. Prograf was an attempt to abstract a parallel mode of thought, but it was never very popular, died with the pre-OSX Mac, and was implemented in a serial fashion for a serial computer. I understand that there are other dataflow languages around, but I've never been exposed to them. (Among Prograf's failings was that it was an inherently graphical language, so the programs weren't very dense in information on an "area of display" basis. This made it hard to see much of the program at once. People seem to keep making this mistake. Look at Scratch or E-Toys. They work for toy examples, but anything serious takes up so much screen space that you can't see enough at once to understand what's going on.)

          However, not being dominant isn't the same as unimportant. They will run certain kinds of programs so much faster that there's essentially no comparison. It's like quantum computing in that way. And I also doubt that that will ever become a dominant form, but if they can stabilize things it may be very important for a very important subclass of problems.

          --
          Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
    • (Score: 0) by Anonymous Coward on Tuesday May 14 2019, @09:34PM

      by Anonymous Coward on Tuesday May 14 2019, @09:34PM (#843601)

      And it seems like someone had been working on either replacement microcode, or native tools for it. However, if I remember correctly, they changed the microcode design between the original (Crusoe?) and the second generation part(or locked it down?), which caused all the previous work to be useless. And since neither chip was ever readily available, outside of low power and usually expensive notebook and tablet computers, no further interest or development on them took place.

      Which is too bad, because something like the Transmeta design, if it was generic enough, would do an excellent job in a multi-core processor for allowing almost native code operation behind something like qemu to allow multi-platform and multi-arch binary execution with the possibility of fast shared memory spaces. While it could have its own share of security risks, it could also mitigate whole classes of security errors and allow the management processor to just be another reprogrammable core in whatever arch you need to program in.

    • (Score: 0) by Anonymous Coward on Tuesday May 14 2019, @10:08PM

      by Anonymous Coward on Tuesday May 14 2019, @10:08PM (#843621)

      > Also, Intel did try a revolutionary new architecture, but it commercially failed miserably because it relied heavily (exclusively) on compiler optimization. Intel ended-up adding x86 support demanded by customers.

      There were a lot of reasons. Lack of x86 support, and expectation that compilers would soon be magical were certainly among them. But the implementations were also fabulously expensive and power hungry. There was never an IA64 you could put in a laptop or even in a vaguely reasonable desktop. It was a set of bad design decisions that also primarily targeted a high end market that no longer really existed by the time it came to market.

    • (Score: 0) by Anonymous Coward on Wednesday May 15 2019, @04:19PM

      by Anonymous Coward on Wednesday May 15 2019, @04:19PM (#843874)

      That was then, this is now. I'm not necessarily suggesting that we need to do this, but these days it's viable to put more than one processor on a single chip. Had Intel been able to do that, there would have been a far greater likelihood of success in this venture. It would have been a temporary lack of progress in speed for a significant increase down the line.

      Or, there's the Apple model where they made all their software capable of running in both modes until there was enough of the new software to remove backwards compatibility.

      Intel just got caught with it's pants down as much as anything.

  • (Score: 5, Interesting) by takyon on Tuesday May 14 2019, @08:37PM (1 child)

    by takyon (881) <reversethis-{gro ... s} {ta} {noykat}> on Tuesday May 14 2019, @08:37PM (#843580) Journal

    In addition to what has been said above, there's still room for improvement:

    • At least two more major node shrinks can be done with known transistor designs.
    • Both AMD and Intel are moving to "chiplet" design which keeps yields high while boosting core counts.
    • Stacked DRAM on chips and memory controllers.
    • DRAM integrated into a 3D package with the CPU.
    • Other improvements such as new transistor designs.

    These apply to other processor designs, but they will have the effect of boosting x86 performance, perhaps by orders of magnitude. More free lunch, less talk about breaking compatibility with existing software.

    --
    [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    • (Score: 2) by RamiK on Tuesday May 14 2019, @09:38PM

      by RamiK (1813) on Tuesday May 14 2019, @09:38PM (#843603)

      More free lunch, less talk about breaking compatibility with existing software.

      Mill doesn't break compatibility with existing software. That's their main selling point over the different capability-based designs out there.

      In addition to what has been said above, there's still room for improvement...

      While I generally agree the x86 will be slowly dying over a few decades, similar points were raised to dismiss ARM's dominance in the high-end smartphones as Intel introduced their Atoms.

      And IBM is getting serious with OpenPOWER and their PR guys are starting to target consumers now: https://www.youtube.com/watch?v=lt8cu8IMLOM [youtube.com]

      --
      compiling...
  • (Score: 3, Funny) by realDonaldTrump on Tuesday May 14 2019, @08:44PM (1 child)

    by realDonaldTrump (6614) on Tuesday May 14 2019, @08:44PM (#843583) Homepage Journal

    Processor, at one time was very trendy. Anybody who was anyone had one. But, so complicated, always breaking down. We got rid of ours. Went back to Blender!

    • (Score: 3, Funny) by Snow on Tuesday May 14 2019, @09:14PM

      by Snow (1601) on Tuesday May 14 2019, @09:14PM (#843590) Journal

      Yeah, processors suck at making margaritas anyways.

  • (Score: 3, Interesting) by Rich on Tuesday May 14 2019, @09:42PM (3 children)

    by Rich (945) on Tuesday May 14 2019, @09:42PM (#843606) Journal

    A practical proof of such suggestions is easy: Have a package that does a clock-accurate simulation (including, say, DDR3 RAM access). In goes a C program, out comes a nanosecond count for an assumed clock rate, or maybe three or four counts for different CPU clocks so a graph can indicate a trend how it scales against external constraints. If they want to stay secretive, it could be a web service. We'd believe them that the hardware representation of their model doesn't exceed a certain fan out for the short time until they back it up with an FPGA.

    Hobbyists write cycle-accurate arcade emulators these days, so to a company that set out to create an entirely new CPU, this should be an easy exercise. I have the gut feeling though (no logical reasoning here), that they'd be arrive at a point where their C compiler is finished to deliver competitive performance "real soon now" until they fold.

    Further remarks:

    - I have the impression that the increasing performance of newer reordering CPUs comes from an increasing ability to optimize suboptimal code. I once wrote a recompiling emulator for the Microblaze FPGA CPU (before QEmu had it) and noticed that trying to improve my code output during development on a very low level had very little effect on overall performance. This was on Core 2 Duo & Quad, so these must already in 2005/6 have had the abilities to optimize at that level. We set up a production system with some high core count AMD, which matched the Intels per Core on compiled code, but the performance completely tanked with the recompiler. I'd assume that all the dynamic stuff will have a very hard time on an architecture that requires really special compilers (e.g. Java, CLR, QEmu...).

    - 32-bit x86 might be archaic, but it is still about the most dense representation of program flow around. I'd much prefer 68K myself, but I fear that ye ole 386 still uses a few percent less memory. If the startups can compile C, they also can recompile x86 (possibly even in software), like Transmeta tried.

    - In nerdy smalltalk I generally claim that 100 MHz are technically enough for everything, and beyond that all workloads can get parallelized. (With the exception of crypto factorization). When computers are slow at any sequential task, that used to be due to I/O latencies, and now (since we have SSDs) it is mostly because interpreted code parses through piles of XML poo (or similarly abusive resource wasting).

    • (Score: 2) by RamiK on Tuesday May 14 2019, @10:48PM (2 children)

      by RamiK (1813) on Tuesday May 14 2019, @10:48PM (#843631)

      A practical proof of such suggestions is easy

      WIP [millcomputing.com] by the looks of things.

      increasing ability to optimize suboptimal code...requires really special compilers

      They've ported LLVM -and refactored a lot of it to be able to target their spec-driven ISA- and are busy porting the C++ standard library and a few other libraries while also writing a testing micro-kernel that can take advantage of the architacture's fat pointers while running real world open source software. The post from above also mentioned they're messing around with micropython while still fixing stuff in the simulator so they're figuring out other compilers and dynamic stuff too.

      32-bit x86 might be archaic, but it is still about the most dense representation of program flow around

      Thumb is considered as dense as i386 and I believe one of the RISC-V's variants matched that in real world comparisons.

      If the startups can compile C, they also can recompile x86 (possibly even in software), like Transmeta tried.

      Very likely. After all, Intel is decoding those x86 instructions live to their microarchitecture instructions.

      In nerdy smalltalk I generally claim that 100 MHz are technically enough for everything, and beyond that all workloads can get parallelized.

      I highly doubt this. There were many attempts at generally parallelizing descriptive documents and replacing postscript even before HTML/JS' XML poo came into place and nothing really worked.

      --
      compiling...
      • (Score: 2) by Rich on Wednesday May 15 2019, @12:28AM (1 child)

        by Rich (945) on Wednesday May 15 2019, @12:28AM (#843650) Journal

        They seem to count cycles against an x86 simulator, but I guess that's a bit away from the opaque ways a modern desktop CPU really works internally.

        I wonder what workload you imagine that couldn't be handled with 100 MHz at one 32 bit IPC and the option to parallelize out?!

        Postscript rendering itself could be nicely broken up. One CPU per path segment could fill the edge tables and one CPU per scanline can fill the spans with gradients. And if you go for compositing bitmaps, the shapes can be parallelized, too. It's a matter of memory bandwidth, and synchronizing shared access itself is a hard task (and might be so hard that new architectures lose their decisive battle here), but there's no inherent single-thread bottleneck anywhere in there.

        Out of my gut, I'd estimate the "lowest overall effort" sweet spot for a performance-tuned software foundation somewhere around a 7 stage in-order CPU with 4 execution units (complex and simple alu, loadstore, branch) that are shared between 2 SMT files (to migitate stalls wasting area). And then two of those. It could be that the required cache to handle the common workloads needs so much silicon area, though, that making the CPU more complex wouldn't matter, but I guess the cost/performance sensitive gaming boxes (e.g. Wii) are near my estimate and somewhat validate it. (Also i'm strangely fond of the RK3399's specs, and not just because it has internal kit to do low-latency 16 channel audio...).

        • (Score: 2) by RamiK on Wednesday May 15 2019, @11:21AM

          by RamiK (1813) on Wednesday May 15 2019, @11:21AM (#843776)

          I'd estimate the "lowest overall effort" sweet spot...

          It's pretty close to the most recent MIPS and the in-order ARM cores.

          One CPU per path segment could fill the edge tables and one CPU per scanline can fill the spans with gradients. And if you go for compositing bitmaps, the shapes can be parallelized, too.

          Ghostscript has multithreaded support since forever and it didn't help since the cursor writes are relative to the previous write and linear ( https://ghostscript.com/pipermail/gs-devel/2011-May/008970.html [ghostscript.com] ). TeX might have a better chance due to the boxes model but I think Knuth would have done as much by now if it was possible.

          There was an unrelated essay regarding from the HP labs back in the day about instruction-level parallelism that was evaluating postscript and specifying all the issues and why it's a good candidate... Can't find it through.

          It's a matter of memory bandwidth, and synchronizing shared access itself is a hard task

          That's like saying the only reason we can't run fast enough is because of friction. Locks and cycles lost to IPC on the one hand and the lacking width and depth of the branch predictor on the other represent real hardware limits that are at the core of the issue much like how slow the cache is. Intel hit the depth ceiling with the Pentium 4 and is now hitting it again with its more recent cores. They even managed to expand to 5 wide with certain instructions using some crazy voodoo I can't follow... Regardless, the solutions space for all of this following the failed Itanium is the topic discussed. Mill especially exposes enough of the pipelines and goes wide enough precisely to solve instruction-level parallelism. They're even talking about 0 cost hardware thread IPCs which is basically why they're writing their own kernel... But there some drawbacks raised in the article that leave more room for different designs which in a way is actually reassuring since it means we'll be seeing different parties coming up with solutions in the next few years.

          Also I'm strangely fond of the RK3399's specs, and not just because it has internal kit to do low-latency 16 channel audio...

          I don't get why everyone like those Cortex-A72 so much. I liked the original RasPis well enough when they came out for the simple stuff they've done well. But at these frequencies I'm WAY too lazy to optimize around 3-wide to save $5 BoM for no real performance/power gains. I'll pick up a Cortex-A76 sbc/laptop when they get cheap enough. And yeah, it will have a nice DSP I'll abuse for guitar effects most likely :D

          --
          compiling...
  • (Score: 0) by Anonymous Coward on Wednesday May 15 2019, @12:10PM

    by Anonymous Coward on Wednesday May 15 2019, @12:10PM (#843793)

    20 years ago there was a new buffer overflow vulnerability just about every week. The solution was not to get rid of buffers, or switch everything to fixed size buffers, or even to switch to bounds checked languages (although many programmers did that anyway). The solution was just to stop writing broken buffer code.

    We are now seeing a situation where hardware designers have made the same mistake in lots of different places, just like software designers did back then. It is annoying right now, and soon they will stop making this particular mistake, and then it will go away, just like buffer overflows, capacitor plague, boot sector viruses, or whatever the problem du jour was.

    And then a new mistake will turn up everywhere and we will go through it again, people will proclaim the sky is falling, engineers will fix things, and life will go on. Again.

  • (Score: 2) by fyngyrz on Wednesday May 15 2019, @02:44PM

    by fyngyrz (6567) on Wednesday May 15 2019, @02:44PM (#843839) Journal

    If companies would invest in competent programmers and give them enough resources to work in efficient languages, applications would be a fraction of the size they typically are today and run much faster.

    Then apps wouldn't need to be pushing the processors so hard, and a simpler architecture would suffice in most cases.

    Find yourself a good c programmer. You'd be amazed at what can be accomplished with a relatively vanilla architecture.

    --
    Knowledge is strength. Unless the opposition has more money.

(1)