http://www.righto.com/2023/03/8086-multiplication-microcode.html
While programmers today take multiplication for granted, most microprocessors in the 1970s could only add and subtract — multiplication required a slow and tedious loop implemented in assembly code. One of the nice features of the Intel 8086 processor (1978) was that it provided machine instructions for multiplication,2 able to multiply 8-bit or 16-bit numbers with a single instruction. Internally, the 8086 still performed a loop, but the loop was implemented in microcode: faster and transparent to the programmer. Even so, multiplication was a slow operation, about 24 to 30 times slower than addition.
In this blog post, I explain the multiplication process inside the 8086, analyze the microcode that it used, and discuss the hardware circuitry that helped it out.3 My analysis is based on reverse-engineering the 8086 from die photos. The die photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. At the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations at the heart of multiplication: addition and shifts. Multiplication also uses a few other hardware features: the X register, the F1 flag, and a loop counter.
(Score: 3, Interesting) by guest reader on Friday March 24, @09:46PM (1 child)
It seems that multiplication is now as fast or even faster (in real arithmetic) than addition.
From this article in "Later advances in multiplication":
(Score: 3, Interesting) by owl on Saturday March 25, @01:57AM
Brought about by the much larger transistor budget available. Just the hardware multiplier circuit for a Core 2 CPU is likely more transistors that the total transistor count quoted for the 8086 (29,000 transistors).
Lots of performance can often be found, provided there are enough transistors available to build the necessary hardware.
(Score: 3, Interesting) by stormwyrm on Saturday March 25, @07:14AM
It seems they did almost the whole thing in microcode, which had performance advantages over programming a shift and add multiplier in x86 code only because loading instructions from memory is inherently slower than microcode. A real hardware multiplier circuit in the ALU was likely unfeasible with integrated circuit technology as it existed in the late 1970s (I think a full 16-bit multiplier would need more transistors than the entire x86 processor itself). I used to optimise multiplications by small constants by adding shifted copies of the multiplier based on the 1-bits in the multiplicand: this could be much faster for reasonably small constant values, e.g. one would multiply AX by 3 by doing:
This works because 3 = 2 (one shift) + 1 (no shift). Or by 6 = (0b110), so one shift (×2) and two shifts (×4):
You'd need to watch out for addition carries and overflow with the shifts here though, and the original 8086 instruction set didn't include a shift by more than one constant bit (you'd need to load the shift count into CL). The speedup was especially pronounced when there were only a few 1-bits in the multiplicand. I reserved the expensive MUL/IMUL instruction for multiplications by non-constants or multiplying by larger numbers.
Numquam ponenda est pluralitas sine necessitate.
(Score: 2) by Rich on Saturday March 25, @08:55PM
The article explains that newer CPUs use hardware multipliers modeled after "Wallace trees" or the "Dadda multiplier". I've never put much thought into it, but I was unter the (now seemingly wrong) assumption that hardware multiplication is a "byproduct" of the barrel shifter the more modern CPUs have where the barrel shifter has a bit of additional hardware to sum simultaneous shifts from the first factor, selected by the second factor.