Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Friday March 24, @06:33PM   Printer-friendly

http://www.righto.com/2023/03/8086-multiplication-microcode.html

While programmers today take multiplication for granted, most microprocessors in the 1970s could only add and subtract — multiplication required a slow and tedious loop implemented in assembly code. One of the nice features of the Intel 8086 processor (1978) was that it provided machine instructions for multiplication,2 able to multiply 8-bit or 16-bit numbers with a single instruction. Internally, the 8086 still performed a loop, but the loop was implemented in microcode: faster and transparent to the programmer. Even so, multiplication was a slow operation, about 24 to 30 times slower than addition.

In this blog post, I explain the multiplication process inside the 8086, analyze the microcode that it used, and discuss the hardware circuitry that helped it out.3 My analysis is based on reverse-engineering the 8086 from die photos. The die photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. At the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations at the heart of multiplication: addition and shifts. Multiplication also uses a few other hardware features: the X register, the F1 flag, and a loop counter.


Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 3, Interesting) by guest reader on Friday March 24, @09:46PM (1 child)

    by guest reader (26132) Subscriber Badge on Friday March 24, @09:46PM (#1298053)

    It seems that multiplication is now as fast or even faster (in real arithmetic) than addition.

    From this article in "Later advances in multiplication":

    - The 8086 took up to 133 clock cycles to multiply unsigned 16-bit values.
    - By 1982, the Intel 286 processor cut this time down to 21 clock cycles.
    - The Intel 486 (1989) used an improved algorithm that could end early, so multiplying by a small number could take just 9 cycles.
    - The Cyrix Cx486SLC (1992) had a 16-bit hardware multiplier that cut word multiply down to 3 cycles.
    - The Intel Core 2 (2006) was even faster, able to complete a 32-bit multiplication every clock cycle.

    • (Score: 3, Interesting) by owl on Saturday March 25, @01:57AM

      by owl (15206) Subscriber Badge on Saturday March 25, @01:57AM (#1298081)

      The Intel Core 2 (2006) was even faster, able to complete a 32-bit multiplication every clock cycle.

      Brought about by the much larger transistor budget available. Just the hardware multiplier circuit for a Core 2 CPU is likely more transistors that the total transistor count quoted for the 8086 (29,000 transistors).

      Lots of performance can often be found, provided there are enough transistors available to build the necessary hardware.

  • (Score: 3, Interesting) by stormwyrm on Saturday March 25, @07:14AM

    by stormwyrm (717) on Saturday March 25, @07:14AM (#1298098) Journal

    It seems they did almost the whole thing in microcode, which had performance advantages over programming a shift and add multiplier in x86 code only because loading instructions from memory is inherently slower than microcode. A real hardware multiplier circuit in the ALU was likely unfeasible with integrated circuit technology as it existed in the late 1970s (I think a full 16-bit multiplier would need more transistors than the entire x86 processor itself). I used to optimise multiplications by small constants by adding shifted copies of the multiplier based on the 1-bits in the multiplicand: this could be much faster for reasonably small constant values, e.g. one would multiply AX by 3 by doing:

    MOV BX,AX
    SHL BX,1
    ADD AX,BX

    This works because 3 = 2 (one shift) + 1 (no shift). Or by 6 = (0b110), so one shift (×2) and two shifts (×4):

    MOV BX,AX
    SHL BX,1
    SHL BX,1
    SHL AX,1
    ADD AX,BX

    You'd need to watch out for addition carries and overflow with the shifts here though, and the original 8086 instruction set didn't include a shift by more than one constant bit (you'd need to load the shift count into CL). The speedup was especially pronounced when there were only a few 1-bits in the multiplicand. I reserved the expensive MUL/IMUL instruction for multiplications by non-constants or multiplying by larger numbers.

    --
    Numquam ponenda est pluralitas sine necessitate.
  • (Score: 2) by Rich on Saturday March 25, @08:55PM

    by Rich (945) on Saturday March 25, @08:55PM (#1298147) Journal

    The article explains that newer CPUs use hardware multipliers modeled after "Wallace trees" or the "Dadda multiplier". I've never put much thought into it, but I was unter the (now seemingly wrong) assumption that hardware multiplication is a "byproduct" of the barrel shifter the more modern CPUs have where the barrel shifter has a bit of additional hardware to sum simultaneous shifts from the first factor, selected by the second factor.

(1)