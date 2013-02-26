https://nand2mario.github.io/posts/2026/80386_barrel_shifter/
I'm currently building an 80386-compatible core in SystemVerilog, driven by the original Intel microcode extracted from real 386 silicon. Real mode is now operational in simulation, with more than 10,000 single-instruction test cases passing successfully, and work on protected-mode features is in progress. In the course of this work, corners of the 386 microcode and silicon have been examined in detail; this series documents the resulting findings.
In the previous post, we looked at multiplication and division -- iterative algorithms that process one bit per cycle. Shifts and rotates are a different story: the 386 has a dedicated barrel shifter that completes an arbitrary multi-bit shift in a single cycle. What's interesting is how the microcode makes one piece of hardware serve all shift and rotate variants -- and how the complex rotate-through-carry instructions are handled.
Anonymous Coward on Friday February 13, @11:57PM
How does the internal microcode not take up cycles? You can do a barrel-shift in one cycle, that includes microcode read, parse, process, copy values into the correct registers, and then shift them.
But how is all that prep - free? How is the "execution" of the microcode free? I could see if things were pipelined, but then I'd expect the overall instruction to take multiple cycles, not just one.
I haven't finished the article yet, maybe this is covered. :-)
ChrisMapla on Saturday February 14, @02:49AM
The article cites the Intel data book, which says many of these instructions take 3 cycles. The actual data movement may be one cycle, but the microcode takes more.
owl on Saturday February 14, @03:11AM
Also, for the AC's edification, the instruction cycle count for the 386, for shifting arithmetic right (just picked as a random shift example) by any amount from 1 to 32, was 3 clocks (if done in a register).
Contrast that with a 286, which took 5+n (with n being the number of bits of shift) for the same instruction, or an 8086 which took 8+4*n (same n).
So for a, say 30 bit shift, the 386 would perform this in 3 clock ticks, the 286 in 35 clock ticks, and an 8086 in 128 clock ticks (assuming a 286 or 8086 could shift 32 bit values).
Source: https://zs3.me/intel_s#sar [zs3.me]
That was the "big deal" with the barrel shifter providing one cycle shifts. The clock cycle counts went down to one clock for microcode setup, one clock for the actual shift, and one clock for microcode to store away the result. This was compared to a significantly larger number of clocks ticks that would have been needed on a 286 or 8086 CPU.
DrkShadow on Saturday February 14, @03:11AM
The article does cite that for ROL/ROR/SHL/SHR/SAR as 3-cycles (which I missed). The intro to the article says,
> Shifts and rotates are a different story: the 386 has a dedicated barrel shifter that completes an arbitrary multi-bit shift in a single cycle.
So then the addition/subtraction (shift-left -- 32-N) takes 1 cycle, the population of the registers takes 1 cycle, and the hardware barrel-shifter takes 1 cycle? That seems like what it ought to be, but I always thought bit-shift operations were the cheapest of the cheap CPU operations to do. I thought they were always 1-cycle. Maybe not..
Anonymous Coward on Saturday February 14, @09:39AM
Look at the shift/rotate timings of 70s microprocessors/minicomputers
Most were one clock per shift because barrel shifters took a lot of logic
to implement.
owl on Saturday February 14, @07:08PM
And some (i.e. 6502) were simply "one bit shift per instruction". One could only do single bit shifts. If one wanted to do a multibit shift one had to execute a loop like so (this from long old memory, so may not be perfect:
ldx #4
loop:
lsr
dex
bne loop
Three instructions in the inner loop. The lsr was 2 clocks, dex was 2 clocks and bne was 2 or 3 clocks depending upon if the code crossed a page boundary. So best case 6 clocks per single bit shift. Or 24 clocks for the four bit shift example above. If we also include the ldx immediate (2 clock) as necessary setup that's 26 clocks for a four bit shift. Being able, on the '386, to perform 1 to 31 bit long shifts in a constant 3 clocks was a huge big deal back in 1985 when the '386 was first available for sale.