Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday January 04, @10:05AM   Printer-friendly

https://www.righto.com/2023/01/inside-8086-processors-instruction.html

The groundbreaking 8086 microprocessor was introduced by Intel in 1978 and led to the x86 architecture that still dominates desktop and server computing. One way that the 8086 increased performance was by prefetching: the processor fetches instructions from memory before they are needed, so the processor can execute them without waiting on the (relatively slow) memory. I've been reverse-engineering the 8086 from die photos and this blog post discusses what I've uncovered about the prefetch circuitry.

The 8086 was introduced at an interesting point in microprocessor history, where memory was becoming slower than the CPU. For the first microprocessors, the speed of the CPU and the speed of memory were comparable.1 However, as processors became faster, the speed of memory failed to keep up. The 8086 was probably the first microprocessor to prefetch instructions to improve performance. While modern microprocessors have megabytes of fast cache2 to act as a buffer between the CPU and much-slower main memory, the 8086 has just 6 bytes of prefetch queue. However, this was enough to increase performance by about 50%.

Prefetching had a major impact on the design of the 8086. Earlier processors such as the 6502, 8080, or Z80 were deterministic. The processor fetched an instruction, executed the instruction, fetched the next instruction, and so forth. Memory accesses corresponded directly to instruction fetching and execution and instructions took a predictable number of clock cycles. This all changed with the introduction of the prefetch queue. Memory operations became unlinked from instruction execution since prefetches happen as needed and when the memory bus is available.


Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 4, Interesting) by Snotnose on Wednesday January 04, @03:20PM (3 children)

    by Snotnose (1623) on Wednesday January 04, @03:20PM (#1285120)

    In the early 80s I was writing 8086 assembly. I knew how many clock cycles each instruction took to execute, and how long it took to fetch data from memory. I wrote a lot of spaghetti code where I would fetch from memory long before I needed it, as long as I had a register I could keep it in until needed.

    Was my code faster? Probably. Was it easy to maintain? Uh, no. Was it worth it? In hindsight, no. But I was self taught and new at all this.

    Ask me how many idle loops I optimized before I learned better....

    --
    I just passed a drug test. My dealer has some explaining to do.
    • (Score: 3, Interesting) by meisterister on Wednesday January 04, @05:10PM (1 child)

      by meisterister (949) on Wednesday January 04, @05:10PM (#1285130) Journal

      I knew how many clock cycles each instruction took to execute, and how long it took to fetch data from memory.

      And you really do need to know both! Intel's instruction timings assumed that the instruction was already prefetched. On the 8086, it probably was. On the 8088... things get tricky. It can read an entire byte per four(?) cycles it you're lucky, so a chain of fast instructions can cause the processor to choke on memory access.

      Michael Abrash wrote an excellent explanation of 8086/8 access behavior in the Graphics Programming Black Book (https://www.jagregory.com/abrash-black-book/#chapter-4-in-the-lair-of-the-cycle-eaters [jagregory.com]).

      --
      (May or may not have been) Posted from my K6-2, Athlon XP, or Pentium I/II/III.
      • (Score: 3, Interesting) by Snotnose on Wednesday January 04, @09:38PM

        by Snotnose (1623) on Wednesday January 04, @09:38PM (#1285181)

        That link is a very interesting stroll down history road. I think I had his Zen Optimizing the 486 from '94 to '96, quickly realizing we were I/O bound not CPU bound. Found out by buying a (very expensive at the time) Pentium PC and saw no performance increase.

        I was lucky in that I came into things from the hardware side. I was a tech who wrote programs to troubleshoot hardware problems (we're talking very early 80s here). Engineering found out, asked "hey, wanna make a lot more money?", to which I said "yes please". So I was very aware of bus timing issues, as well as DRAM refresh. I was not programming an 8088 (although I had one at home), nor did our system have the display cycle stealers. But I was most definitely aware of how the EU and the BIU worked together, along with the underlying hardware.

        It was '91 or so when I couldn't get hand crafted assembly code to outpace the compiler output. It was for a fax machine on a 16032 CPU (later renamed 32016 for marketing reasons). I had x ms to scan a line and I just couldn't git 'r done. We ended up with more expensive HW running at a faster clock rate.

        --
        I just passed a drug test. My dealer has some explaining to do.
    • (Score: 2) by kazzie on Wednesday January 04, @06:03PM

      by kazzie (5309) Subscriber Badge on Wednesday January 04, @06:03PM (#1285138)

      I've spent the last week getting acquainted with the 6507 (6502) in the Atari 2600, and the cycle counting required to make sure that you got everything done in time for the next scanline. Throwing prefetched instructions and queue flushes in too would have made for a right ol' headache...

  • (Score: 2) by bzipitidoo on Wednesday January 04, @05:17PM (3 children)

    by bzipitidoo (4388) Subscriber Badge on Wednesday January 04, @05:17PM (#1285131) Journal

    Huh, didn't know prefetch was done that far back. I've never been clear on what the differences were between the 8088 and the 8086, just a vague idea that despite being assigned a lower number, the 8086 was the more advanced processor. You'd think the 8088 was an 8086+2, some sort of enhanced 8086. Was prefetch one of the differences?

    In the late 1980s, pipelining was the hot thing to do to speed up computers. Prefetch lots of instructions, have 4 or more instructions in the pipeline. No doubt someone thought of speculative execution then, but I'd guess that prior to the mid 1990s, limits on circuitry likely meant it couldn't be implemented at a reasonable cost. As I recall, the Pentium (the first one) had branch prediction, which they goofed up in the first Pentiums, having it misread the prediction to take the less likely branch rather than the more likely one. That bug didn't stop the chip from working correctly, it just wasn't quite as fast as it could have been. The infamous FDIV bug on the other hand, yeah.

    • (Score: 3, Informative) by turgid on Wednesday January 04, @05:51PM

      by turgid (4318) Subscriber Badge on Wednesday January 04, @05:51PM (#1285135) Journal

      The 8088 was the same as the 8086 internally, but externally it had a multiplexed 8-bit data bus (it was 16 bits on the 8086) so it was smaller and cheaper and cheaper to make motherboards for. There was an 80186 and an 80188. When the 386 came out, later they made a 16-bit version (externally) called the 386SX.

    • (Score: 3, Informative) by owl on Wednesday January 04, @06:12PM (1 child)

      by owl (15206) Subscriber Badge on Wednesday January 04, @06:12PM (#1285140)

      No doubt someone thought of speculative execution then

      You'll find, if you really dig around in computer CPU history, that almost all of the "techniques" that were hyped as "new and improved" for a current (at the time) Intel CPU were already done, sometimes, long before, for other computer systems. Speculative execution dates back to at least 1967 and Tomasulo's algorithm [wikipedia.org] for the IBM System/360 Model 91 mainframe.

      For almost everything Intel's ever added to an x86 CPU, there was an equivalent that had been done, often many years before, on a mini or a mainframe CPU. The difference often amounted to whatever 'tweaks' were needed to apply the technique to the x86 instruction set, but the underlying method was just Intel finally having enough transistors to build in yet another feature someone else invented elsewhere.

      limits on circuitry likely meant it couldn't be implemented at a reasonable cost.

      Yup, the techniques flowed down into the Intel CPU's as the transistor count and cost budgets of Intel's chips allowed. But for areas where there was much more 'money' to go around (mainframes) the techniques existed long before Intel added them to x86.

      • (Score: 3, Interesting) by bzipitidoo on Wednesday January 04, @08:44PM

        by bzipitidoo (4388) Subscriber Badge on Wednesday January 04, @08:44PM (#1285162) Journal

        Ah, yes, been years since my undergrad class on computer architecture. The professor often said that the techniques used in mainframes eventually came to personal computers, with a lag of about a decade.

        One other thing was a textbook, from a graduate course on architecture, with an appendix in which they ripped Intel's x86 architecture a new one. Said it was horrible, and listed a bunch of reasons why. Too few general purpose registers, lots of register specific commands, the segmented memory model, using a load, execute, store model for integer math, but in the 8087 a stack based model for floating point math, and CISC in general.

  • (Score: 2) by ElizabethGreene on Wednesday January 04, @05:58PM (5 children)

    by ElizabethGreene (6748) on Wednesday January 04, @05:58PM (#1285136)

    According to TFA the prefetch cache is discarded when the instruction pointer is set to an arbitrary value, e.g. for a jmp. An interesting side effect of this would be that an app with a bunch of flat code should be more performant on this hardware* vs. an app that was coded with a large number of function calls.

    That explains why they made such a big deal about execution prediction and the precache on the Pentium. Interesting!

    • (Score: 3, Insightful) by owl on Wednesday January 04, @07:10PM (4 children)

      by owl (15206) Subscriber Badge on Wednesday January 04, @07:10PM (#1285146)

      An interesting side effect of this would be that an app with a bunch of flat code should be more performant on this hardware* vs. an app that was coded with a large number of function calls.

      That 'side effect' is true for any CPU architecture. A flat sequence of code, without any function calls or branches, and assuming a memory fast enough to keep the CPU supplied with instructions, will be faster than the equivalent with a lot of function calls and branches.

      Branches and function calls add overhead, and not all of that overhead can be hidden. Branch prediction attempts to hide most of the overhead for branches, but even so there is still more overhead as compared to a stream of flat code. Function calls always add some overhead that can't be fully hidden, because function calls save the state of the CPU at the call point along with the return address, and then restore the CPU state upon returning (at a minimum, at least the registers the function will utilize). This is why modern compilers have function in-lining and branch unrolling, both allow the programmer to write nice logical loops and function calls, but allow for the generated code to be closer to a stream of flat code.

      However there is a point of diminishing returns for real world systems. Memory speeds are not anywhere near fast enough to keep a CPU fully supplied with instructions in a flat, no branch, no function call, stream of code. So for real systems, the optimum usually occurs somewhere just before the flat stream of code is too large to fit into the L1 instruction cache.

      • (Score: 2) by ChrisMaple on Wednesday January 04, @08:02PM (2 children)

        by ChrisMaple (6964) on Wednesday January 04, @08:02PM (#1285156)

        In my experience, register saving/restoring is done by the called function because it knows what registers it corrupts.

        • (Score: 3, Insightful) by owl on Wednesday January 04, @08:56PM

          by owl (15206) Subscriber Badge on Wednesday January 04, @08:56PM (#1285166)

          No matter where the state save occurs, it is a necessary overhead to making function calls -- which was the point.

        • (Score: 1) by lars_stefan_axelsson on Friday January 06, @12:45PM

          by lars_stefan_axelsson (3590) on Friday January 06, @12:45PM (#1285459)

          That's the advantage. The disadvantage is that the callee then may save/restore registers that the caller doesn't care about. Whichever way you do it there are drawbacks. (And as mentioned below, register windows were an interesting approach. But it came with its own disadvantages.)

          --
          Stefan Axelsson
      • (Score: 2) by turgid on Wednesday January 04, @08:06PM

        by turgid (4318) Subscriber Badge on Wednesday January 04, @08:06PM (#1285157) Journal

        Register windows were an interesting attempt to mitigate some of the function call overhead.

(1)