I had some time to kill in a hospital so I took some paper and solved a problem which has been annoying me for more than three years. I wanted to implement an instruction set with good code density while maintaining an efficient hardware implementation. This requires the presentation of one or more useful instruction fields in the approximate frequency in which they are required. It also requires using the majority of bit permutations within each field. Presenting useful fields is trivial. Popular processor architectures like IBM360, DECVAX and Intelx86 make instruction length implicit as part of an instruction. RISCprocessor architectures insist on one size instructions (usually with exceptions for constants). ARM Thumb re-uses a (rarely used) conditional field as a method of compacting many of its 32 bit instructions to 16 bit respesentations. However, the Xtensa processor architecture shows that good code density can be achieved by straddling a golden ratio with 2 byte instructions and 3 byte instructions where the length is implicit to the instruction.
I've found that similar results can be achieved by making instruction size explicit rather than implicit. Specifically, through the use of BER. With this encoding, bytes are split into a one bit field and a seven bit field. The top bit indicates if another byte follows. This allows numeric values of any size to be encoded in multiples of seven bits. In particular, it allows processor instructions to be encoded as 7 bits, 14 bits, 21 bits, 28 bits or more. It is also possible to choose representations other than octets or where escape bits are skipped. So, for example, 2 byte, 3 byte and 5 byte encoding would only require two or three escape bits among 37 bits or significantly more. This allows detailed instructions to be defined while programs match or exceed to code density of AVR or ARM. Explicit escapes appear to reduce code density but it provides more options for compaction while simplifying instruction decode.
Within an unpacked instruction, fields can be arranged (or split) to minimize average instruction length. Where fancy functionality is rarely used, such as ARM's pre-scaler, the default bit permutation of zero requires no space when packed as BER. Another trick, most useful within a 3-address machine, is that one reference to a register may be relative to another. This allows compact representation for common cases, such 2-address instructions. An example would be g+=h which can also be written as g=g+h or g=h+g. An instruction which references g more than once may use zero in subsequent references. Obviously, this requires all references to be skewed but this can be achieved with XOR or addition such that common cases have zero in desired fields. This allows an instruction to be compacted more efficiently.
A worked example would be 5 bits for opcode, 2 bits for addressing mode and 3 bits for each of three register references. Ordinarily, this would require 16 bits. I've found 16 bit encoding to be quite restrictive. The conventional approach is an implicit sequence of additional 16 bit words. However, with BER and skewed register references, 2-address instructions only require 13 bits. This fits into 2 bytes while retaining fast decode of significantly longer instructions. In practice, instruction size is similar to AMD's VEX prefix but implemented cleanly rather than being bolt-on cruft. This arrangement also satisfies the criteria of an instruction-space as a binary fraction where there is a 1 byte breakpoint instruction and numerous useful instructions with 1 byte and 2 byte encoding. Although there is no explicit bias within a set of general registers, if operand references cross encoded byte boundaries then use of some registers leads to a more compact representation. This runs faster - even within a virtual machine. This provides opportunities to tweak code efficiency within a conceptually simple model but it isn't essential.
This was all known to me but the piece that had been annoying me for more than three years was instruction set representation of a fully functional processor. It is easy to take specific cases, like g=h-i, or common optimizations, such as a-=c but it is more difficult to devise ALU, stack and main memory interface with realistic code density. A designer must ensure that there are a practical set of instructions which are sufficiently over the break-even point. Also have to ensure that bit patterns are only allocated to one task. When I worked on a rigidly fixed 16 bit instruction set in which opcode could appear within each of four fields, I had numerous false successes in which the same bit patterns represented two or more operations. A more conventional division of opcode and operands reduces this problem but leads to the opposite problem of underwhelming allocation. This may occur directly where a common instruction requires a long encoding or indirectly where idioms require an inordinately long sequence of instructions. There are two examples here.
x86 is generally regarded as achieving the best code density among widely deployed systems. The method by which this is achieved is questionable market economics. It is also achieved by re-cycling obsolete bit patterns. For example, the INT3 breakpoint instruction. Or 0xF0, which was a 16 bit stack decrement and now a prefix for MMX and suchlike. However, the legacy 16 bit mode has a very curious wrinkle. 20 bit addressing (1MB) is achieved be pre-scaling 16 bit segment pointers by four bits. In 2018, an armchair critic can easily suggest that pre-scaling by eight bits would have given a 24 bit address-space (16MB). However, this was devised in an era when systems shipped with 1KB RAM or less. (The 1976 pre-release documentation for 6502describes implementation of systems with 256 bytes or less.) Wasting up to 15 bytes between each of four segments was considered positively profligate. However, the astounding part comes from the rare case when pointers between segments are tested for equality. In this case, Intel's macro assembler generated a 44 byte sequence. That's excessive.
We've mostly dumped that historical baggage but the problem of addition and multiplication remain. This is the most pertinent example. When implementing multiplication, typically upon signed, two's complimentbinaryintegers, there are several choices:-
Signed multiplication only.
Unsigned multiplication only.
Two variants. Both inputs signed or unsigned.
Four variants. Any input signed or unsigned.
Three variants. This is from the observation that multiplication is commutative. However addressing modes on available operands may be:-
Restricted to registers.
Either operand may source data from memory.
Both operands may source data from memory.
There is also the consideration that multiplying m bits × n bits requires approximately m+n bits of answer to avoid overflow. This leads to choices:-
Store result in two specificied registers.
Store result in two sequential registers.
Store result in two registers or one memory location.
Store result in two registers or two memory locations.
There are also options where one or more pieces of data are discarded. For example, by "writing" to a null register. This may be dedicated constant register or it may be a register which is (otherwise) unremarkable. There are a similar set of choices for the quotient and remainder of integer division.
The initial release of the MC68000 provided MULS and MULU where all inputs and outputs could be register or memory. However, for very sound reasons, 32 bit arithmetic was initially implemented a two rounds of 16 bit operations and this led to the very obvious design decision to only implement 16 bit multiplication. Despite the unsigned capability and flexible addressing modes, the width limitation was deeply unpopular and led many people to falsely regard the entire range of MC680x0 processors as being 16 bit or only fully compatible with 16 bit implementations.
In RISC design, unsigned multiplication appears to be the obvious choice on the basis that, for example, 32 bit × 32 bit unsigned multiplication can be used to implement 31 bit × 31 bit signed multiplication. (Where do the sign bits go? They get held implicitly in the program counter when the signed multiplication macro executes one of four paths for positive and negative inputs.)
My personal preference is unsigned multiplication with resticted input range and no integer division because this provides the most implementation options and the most upward compatibility while the minimal implementation requires the least gates and energy consumption. However, this is within the context of a design which (provisionally) may source one aligned operand from memory and pre-scales forward branches to cache-line boundaries. (Nominally, 64 bytes.) So, synthesizing signed multiply requires execution paths which form a two tier binary tree. This requires four cache-lines. Each execution path requires two conditional branches (and possible execution pipeline stalls) or backward branches to reduce the number of cache-lines required. The most stream-lined version requires a temporary register and 256 byte of instruction-space. This provides less functionality than a MC68000 while incurring far more clock cycles than most RISC designs. That's unworkable.
We have to make a cut somewhere. We either have multiple memory access cycles, two inputs, two outputs and special 4-address multiplication instructions or some functionality must be discarded.
So, I'll try again. No integer division, signed multiplication with unrestricted input range and no handling of overflow. This allows all variants to be synthesized without branching but each unsigned input over the full range doubles the number of multiplications required. Temporary registers may also be required to assemble the pieces. Overhead can be reduced with moderate abuse of type casting. For example, it is possible to multiply and accumulate 20 bit unsigned × 8 bit unsigned in one instruction (and possibly one clock cycle). The design is otherwise fairly agnostic about signed and unsigned data. At worst, comparison of unsigned integers can be synthesized from signed comparison via use of two temporary registers and zero extra branches.
However, after resolving this ALU functionality (and previously resolving FPU and SIMD functionality), stack and memory addressing modes are completely unresolved. My preference is bit addressable memory or possibly nybble addressable but anything smaller than bytes would be completely incompatible with the vast majority of RAM. Bit addressing also bloats constants and my designs a relatively bad in this regard. Bit addressable offsets don't affect pre-decrement or post-increment but it fails horribly when accessing an element within a structure or when skipping through an array of fixed length structures. From this, a friend made the observation that a stack only has to be aligned to the width of a general register and stack offsets can be implemented portably for any width of register. Specifically, 32 bit code can work with a 64 bit stack while encouraging good code density.
However, a trend popularized by VAX, MIPS, ARM and other processor architectures has been to steal two of the general purpose registers and replace them with a call stack and program counter. This allows MOV to PC to implement branch and post-increment to implement stack pop. (My preference for a zero register allows OR with zero to implement MOV while allowing indirect addressing modes to access a stack hidden under the zero register.) To maintain historical compatibility, performance and code density, some addressing modes don't make sense and some are detrimental. In particular, program counter relative emerges from a lack of position independent code and a failure to separate code and data. While it is possible to define meaningful semantics to the majority of bit permutations, this requires CISCmicro-code or extra decode logic. This slows execution. Addressing modes can be offloaded to dedicated load and store instructions but this messes with code density more than ALU throughput. There is also the observation that x86 allows the maximum tricksiness of unaligned, little-endian access. This covers atomic, memory-to-memory conditional swap operations and, in most cases, doesn't cause the Turing complete memory interface to deadlock.
While it has never been advisible to store a 2 byte pointer on an odd memory boundary, use of RISC compatible, open source compilers has allowed aligned storage to become the norm. From IntelCore processors onwards, unaligned access is implemented with penalty, without lawyers suing for deceptive practice and, in most cases, performs read and write to the requested memory locations. (See Intel Core Errata for reasons why NASA uses older Intel processors.)
Among schemes which work correctly, the trend has been to break historical instructions into micro-ops which make one memory access. Where access crosses word, cache or page boundary, two accesses may require two micro-ops. For new designs, opcodes may source one aligned piece of data while performing computation. Dedicated instructions may also load or store aligned data in one instruction or unaligned data in two instructions. For efficiency, code density and historical compatibility, these instructions may implicitly set atomic locks and/or perform post-increment or suchlike. Intel's Itanium only offers pre-decrement in similar circumstances. This may be more useful or may be related to other details of the processor architecture. Regardless, it is trivial to make a register which is also a ripple counter. Also, it is trivial to make a decrementing counter by inverting the inputs and outputs or an incrementing counter. However, it is more cumbersome to make a latch which increments and decrements in a range of steps. Therefore, asymmetry, such as pre-decrement only, is occasionally seen.
I'd resolved much of this already. I just had to be certain that memory-to-memory addressing, such as MC68000's MOV (A0)+,(A1)+ is not a template to follow slavishly. Indeed, a suitable RISC design allows optional unaligned access which exceeds the historical functionality without slowing the common case.
In the spirit of Fred Brooks Junior ("Show me your data structures and I shall be enlightened. Show me flowcharts and I shall remain mystified."), for eight registers, three stacks (system, USR1, USR2) and a stack of program counters, instructions have:-
You may notice several omissions from the instruction set. For example, as is often defined in C, logical shifts can be derived from arithmetic shifts. Also, OR with zero provides register-to-register MOV. Most importantly, where is addition? There is no 3-address addition which can source from memory. However, register-to-register multiplication is commutative and therefore some addition can be hidden in the multiply instruction. Specifically:-
a = b + ((imm + c) << pre)
a += b + ((imm + c) << pre)
It is useful to implement:-
a = 0
a = const
a += const
a -= const
a += a
a = b + c
a = b * b
a += b * b
Of these, handling constants is the most awkward. However, it may be possible to amortize this difficulty if constants are sourced from stack. Fixed quantity left shifts are provided with all ALU operations. With some byte manipulation, this also provides fixed quantity right shifts. See definition of ntohs() for example. (This idiom may be replaced with byte swap or word swap, if available.) Unused bit patterns may be used for AND, OR and XOR. Division and modulo which cannot be implemented as bit operations are implemented with a subroutine. This is standard practice on AVR and ARM. At present, there are no instructions to cast between integers and floating point. However, immediate values are cast appropriately and this may form the basis of conversion subroutines. Epsilon values over a large range can be obtained with use of pre-scaler and floating point division. "Branch never" means call subroutine. "Branch never nowhere" means return from subroutine.
Although the instruction set is not completely orthogonal, omissions can be written as unconditional sequences of instructions which may use temporary variables. These local variables compete for register allocation or stack allocation like any other local variable. Ignoring the asymmetry of register-to-register addition with pre-scale, 2^r registers are completely orthogonal. Register allocation can be shuffled with impunity. However, program length may differ slightly when packed encodings change. When combined with function inlining and a callee saves, stack only call convention, register allocation has no dependencies between functions.
My Ideal Processor, Part 7
(This is the 58th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I had some time to kill in a hospital so I took some paper and solved a problem which has been annoying me for more than three years. I wanted to implement an instruction set with good code density while maintaining an efficient hardware implementation. This requires the presentation of one or more useful instruction fields in the approximate frequency in which they are required. It also requires using the majority of bit permutations within each field. Presenting useful fields is trivial. Popular processor architectures like IBM 360, DEC VAX and Intel x86 make instruction length implicit as part of an instruction. RISC processor architectures insist on one size instructions (usually with exceptions for constants). ARM Thumb re-uses a (rarely used) conditional field as a method of compacting many of its 32 bit instructions to 16 bit respesentations. However, the Xtensa processor architecture shows that good code density can be achieved by straddling a golden ratio with 2 byte instructions and 3 byte instructions where the length is implicit to the instruction.
I've found that similar results can be achieved by making instruction size explicit rather than implicit. Specifically, through the use of BER. With this encoding, bytes are split into a one bit field and a seven bit field. The top bit indicates if another byte follows. This allows numeric values of any size to be encoded in multiples of seven bits. In particular, it allows processor instructions to be encoded as 7 bits, 14 bits, 21 bits, 28 bits or more. It is also possible to choose representations other than octets or where escape bits are skipped. So, for example, 2 byte, 3 byte and 5 byte encoding would only require two or three escape bits among 37 bits or significantly more. This allows detailed instructions to be defined while programs match or exceed to code density of AVR or ARM. Explicit escapes appear to reduce code density but it provides more options for compaction while simplifying instruction decode.
Within an unpacked instruction, fields can be arranged (or split) to minimize average instruction length. Where fancy functionality is rarely used, such as ARM's pre-scaler, the default bit permutation of zero requires no space when packed as BER. Another trick, most useful within a 3-address machine, is that one reference to a register may be relative to another. This allows compact representation for common cases, such 2-address instructions. An example would be g+=h which can also be written as g=g+h or g=h+g. An instruction which references g more than once may use zero in subsequent references. Obviously, this requires all references to be skewed but this can be achieved with XOR or addition such that common cases have zero in desired fields. This allows an instruction to be compacted more efficiently.
A worked example would be 5 bits for opcode, 2 bits for addressing mode and 3 bits for each of three register references. Ordinarily, this would require 16 bits. I've found 16 bit encoding to be quite restrictive. The conventional approach is an implicit sequence of additional 16 bit words. However, with BER and skewed register references, 2-address instructions only require 13 bits. This fits into 2 bytes while retaining fast decode of significantly longer instructions. In practice, instruction size is similar to AMD's VEX prefix but implemented cleanly rather than being bolt-on cruft. This arrangement also satisfies the criteria of an instruction-space as a binary fraction where there is a 1 byte breakpoint instruction and numerous useful instructions with 1 byte and 2 byte encoding. Although there is no explicit bias within a set of general registers, if operand references cross encoded byte boundaries then use of some registers leads to a more compact representation. This runs faster - even within a virtual machine. This provides opportunities to tweak code efficiency within a conceptually simple model but it isn't essential.
This was all known to me but the piece that had been annoying me for more than three years was instruction set representation of a fully functional processor. It is easy to take specific cases, like g=h-i, or common optimizations, such as a-=c but it is more difficult to devise ALU, stack and main memory interface with realistic code density. A designer must ensure that there are a practical set of instructions which are sufficiently over the break-even point. Also have to ensure that bit patterns are only allocated to one task. When I worked on a rigidly fixed 16 bit instruction set in which opcode could appear within each of four fields, I had numerous false successes in which the same bit patterns represented two or more operations. A more conventional division of opcode and operands reduces this problem but leads to the opposite problem of underwhelming allocation. This may occur directly where a common instruction requires a long encoding or indirectly where idioms require an inordinately long sequence of instructions. There are two examples here.
x86 is generally regarded as achieving the best code density among widely deployed systems. The method by which this is achieved is questionable market economics. It is also achieved by re-cycling obsolete bit patterns. For example, the INT3 breakpoint instruction. Or 0xF0, which was a 16 bit stack decrement and now a prefix for MMX and suchlike. However, the legacy 16 bit mode has a very curious wrinkle. 20 bit addressing (1MB) is achieved be pre-scaling 16 bit segment pointers by four bits. In 2018, an armchair critic can easily suggest that pre-scaling by eight bits would have given a 24 bit address-space (16MB). However, this was devised in an era when systems shipped with 1KB RAM or less. (The 1976 pre-release documentation for 6502 describes implementation of systems with 256 bytes or less.) Wasting up to 15 bytes between each of four segments was considered positively profligate. However, the astounding part comes from the rare case when pointers between segments are tested for equality. In this case, Intel's macro assembler generated a 44 byte sequence. That's excessive.
We've mostly dumped that historical baggage but the problem of addition and multiplication remain. This is the most pertinent example. When implementing multiplication, typically upon signed, two's compliment binary integers, there are several choices:-
There is also the consideration that multiplying m bits × n bits requires approximately m+n bits of answer to avoid overflow. This leads to choices:-
There are also options where one or more pieces of data are discarded. For example, by "writing" to a null register. This may be dedicated constant register or it may be a register which is (otherwise) unremarkable. There are a similar set of choices for the quotient and remainder of integer division.
The initial release of the MC68000 provided MULS and MULU where all inputs and outputs could be register or memory. However, for very sound reasons, 32 bit arithmetic was initially implemented a two rounds of 16 bit operations and this led to the very obvious design decision to only implement 16 bit multiplication. Despite the unsigned capability and flexible addressing modes, the width limitation was deeply unpopular and led many people to falsely regard the entire range of MC680x0 processors as being 16 bit or only fully compatible with 16 bit implementations.
In RISC design, unsigned multiplication appears to be the obvious choice on the basis that, for example, 32 bit × 32 bit unsigned multiplication can be used to implement 31 bit × 31 bit signed multiplication. (Where do the sign bits go? They get held implicitly in the program counter when the signed multiplication macro executes one of four paths for positive and negative inputs.)
My personal preference is unsigned multiplication with resticted input range and no integer division because this provides the most implementation options and the most upward compatibility while the minimal implementation requires the least gates and energy consumption. However, this is within the context of a design which (provisionally) may source one aligned operand from memory and pre-scales forward branches to cache-line boundaries. (Nominally, 64 bytes.) So, synthesizing signed multiply requires execution paths which form a two tier binary tree. This requires four cache-lines. Each execution path requires two conditional branches (and possible execution pipeline stalls) or backward branches to reduce the number of cache-lines required. The most stream-lined version requires a temporary register and 256 byte of instruction-space. This provides less functionality than a MC68000 while incurring far more clock cycles than most RISC designs. That's unworkable.
We have to make a cut somewhere. We either have multiple memory access cycles, two inputs, two outputs and special 4-address multiplication instructions or some functionality must be discarded.
So, I'll try again. No integer division, signed multiplication with unrestricted input range and no handling of overflow. This allows all variants to be synthesized without branching but each unsigned input over the full range doubles the number of multiplications required. Temporary registers may also be required to assemble the pieces. Overhead can be reduced with moderate abuse of type casting. For example, it is possible to multiply and accumulate 20 bit unsigned × 8 bit unsigned in one instruction (and possibly one clock cycle). The design is otherwise fairly agnostic about signed and unsigned data. At worst, comparison of unsigned integers can be synthesized from signed comparison via use of two temporary registers and zero extra branches.
However, after resolving this ALU functionality (and previously resolving FPU and SIMD functionality), stack and memory addressing modes are completely unresolved. My preference is bit addressable memory or possibly nybble addressable but anything smaller than bytes would be completely incompatible with the vast majority of RAM. Bit addressing also bloats constants and my designs a relatively bad in this regard. Bit addressable offsets don't affect pre-decrement or post-increment but it fails horribly when accessing an element within a structure or when skipping through an array of fixed length structures. From this, a friend made the observation that a stack only has to be aligned to the width of a general register and stack offsets can be implemented portably for any width of register. Specifically, 32 bit code can work with a 64 bit stack while encouraging good code density.
However, a trend popularized by VAX, MIPS, ARM and other processor architectures has been to steal two of the general purpose registers and replace them with a call stack and program counter. This allows MOV to PC to implement branch and post-increment to implement stack pop. (My preference for a zero register allows OR with zero to implement MOV while allowing indirect addressing modes to access a stack hidden under the zero register.) To maintain historical compatibility, performance and code density, some addressing modes don't make sense and some are detrimental. In particular, program counter relative emerges from a lack of position independent code and a failure to separate code and data. While it is possible to define meaningful semantics to the majority of bit permutations, this requires CISC micro-code or extra decode logic. This slows execution. Addressing modes can be offloaded to dedicated load and store instructions but this messes with code density more than ALU throughput. There is also the observation that x86 allows the maximum tricksiness of unaligned, little-endian access. This covers atomic, memory-to-memory conditional swap operations and, in most cases, doesn't cause the Turing complete memory interface to deadlock.
While it has never been advisible to store a 2 byte pointer on an odd memory boundary, use of RISC compatible, open source compilers has allowed aligned storage to become the norm. From Intel Core processors onwards, unaligned access is implemented with penalty, without lawyers suing for deceptive practice and, in most cases, performs read and write to the requested memory locations. (See Intel Core Errata for reasons why NASA uses older Intel processors.)
Among schemes which work correctly, the trend has been to break historical instructions into micro-ops which make one memory access. Where access crosses word, cache or page boundary, two accesses may require two micro-ops. For new designs, opcodes may source one aligned piece of data while performing computation. Dedicated instructions may also load or store aligned data in one instruction or unaligned data in two instructions. For efficiency, code density and historical compatibility, these instructions may implicitly set atomic locks and/or perform post-increment or suchlike. Intel's Itanium only offers pre-decrement in similar circumstances. This may be more useful or may be related to other details of the processor architecture. Regardless, it is trivial to make a register which is also a ripple counter. Also, it is trivial to make a decrementing counter by inverting the inputs and outputs or an incrementing counter. However, it is more cumbersome to make a latch which increments and decrements in a range of steps. Therefore, asymmetry, such as pre-decrement only, is occasionally seen.
I'd resolved much of this already. I just had to be certain that memory-to-memory addressing, such as MC68000's MOV (A0)+,(A1)+ is not a template to follow slavishly. Indeed, a suitable RISC design allows optional unaligned access which exceeds the historical functionality without slowing the common case.
In the spirit of Fred Brooks Junior ("Show me your data structures and I shall be enlightened. Show me flowcharts and I shall remain mystified."), for eight registers, three stacks (system, USR1, USR2) and a stack of program counters, instructions have:-
The first 5 bits define opcodes in groups of four such that:-
ALU or FPU instruction performs one of:-
You may notice several omissions from the instruction set. For example, as is often defined in C, logical shifts can be derived from arithmetic shifts. Also, OR with zero provides register-to-register MOV. Most importantly, where is addition? There is no 3-address addition which can source from memory. However, register-to-register multiplication is commutative and therefore some addition can be hidden in the multiply instruction. Specifically:-
It is useful to implement:-
Of these, handling constants is the most awkward. However, it may be possible to amortize this difficulty if constants are sourced from stack. Fixed quantity left shifts are provided with all ALU operations. With some byte manipulation, this also provides fixed quantity right shifts. See definition of ntohs() for example. (This idiom may be replaced with byte swap or word swap, if available.) Unused bit patterns may be used for AND, OR and XOR. Division and modulo which cannot be implemented as bit operations are implemented with a subroutine. This is standard practice on AVR and ARM. At present, there are no instructions to cast between integers and floating point. However, immediate values are cast appropriately and this may form the basis of conversion subroutines. Epsilon values over a large range can be obtained with use of pre-scaler and floating point division. "Branch never" means call subroutine. "Branch never nowhere" means return from subroutine.
Although the instruction set is not completely orthogonal, omissions can be written as unconditional sequences of instructions which may use temporary variables. These local variables compete for register allocation or stack allocation like any other local variable. Ignoring the asymmetry of register-to-register addition with pre-scale, 2^r registers are completely orthogonal. Register allocation can be shuffled with impunity. However, program length may differ slightly when packed encodings change. When combined with function inlining and a callee saves, stack only call convention, register allocation has no dependencies between functions.