Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Saturday October 31 2015, @03:33AM   Printer-friendly
from the A|Journey|Through|the|CPU|Pipeline dept.

It is good for programmers to understand what goes on inside a processor. The CPU is at the heart of our career.

What goes on inside the CPU? How long does it take for one instruction to run? What does it mean when a new CPU has a 12-stage pipeline, or 18-stage pipeline, or even a "deep" 31-stage pipeline?

Programs generally treat the CPU as a black box. Instructions go into the box in order, instructions come out of the box in order, and some processing magic happens inside.

As a programmer, it is useful to learn what happens inside the box. This is especially true if you will be working on tasks like program optimization. If you don't know what is going on inside the CPU, how can you optimize for it?

A primer for those with a less formal background.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by Rich on Saturday October 31 2015, @05:52AM

    by Rich (945) on Saturday October 31 2015, @05:52AM (#256810) Journal

    Modern superscalar CPUs seem to have most of the knowledge a solid assembly programmer is able to put into his average project. Point in case: I wrote a dynamic recompiler for an exotic MIPS-like CPU into x86 so we could run our tests in real time on the build rig without resorting to using a cluster of actual hardware. Once the basic issues were sorted (e.g. streamlining the memory mapping), I found that more efforts to make the compilation more efficient had diminishing return.

    We didn't have the time to deeply investigate, but I assume that when you have a wide pipeline with a lot of execution units, there's almost always an idle slot left to stick the inefficient instructions that are encountered. Much the same for array bounds checking, by the way. The checking itself will usually be parallelized (without any hazards to affect the remaining pipeline) and the branch predictor soon knows that the condition trigger will never be taken (and in case it does, one 30-cycle stall is your least problem).

    It might be that compiler backend writers would want a deep knowledge here not only to squeeze out the last few percent, but also to avoid to run into flaws certain CPU implementations might have. But that is sufficiently advanced to appear like magic even to solid assembly coders. One thing to remember, however, is that going out of cache to main RAM will cost dearly. Which funnily enough perverts the C++ template attitude which says that everything must be done without overhead - it will be more efficient to have a single routine that processes an "overhead" dynamic element size (parallelized) and goes through handlers (branches predicted) than to have multiple specializations that each handle the data in a "perfect" way, because the latter thrashes the caches (i-caches at least).

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2