ARM is a mess. Skip details if too complicated. When I see "ARM Cortex", I translate this as "ARM, with the interrupt controller which is incompatible with every other variant of ARM". Indeed, when we say ARM, is that ARM with the 26 bit address space and dedicated zero register, ARM with the 32 bit address space (with or without Javahardware, with or without one of the Thumb modes, with or without one of the SIMD variants) or ARM with 64 bit registers? And that's before we get to ARM licensee variants. So, for example, Atmel ARM processors are configured very much like micro-controllers. I/O pins may be set in groups of 32 to be digital inputs or digital outputs (with optional integral pull-up or pull-down resistors) or over-ridden to provide other functions. Broadcom's BCM8255, used in some models of Raspberry Pi, has a banks of 30 bit registers where I/O pins can be set in groups of 10 and potential be set to one of eight functions. I presume ARM processors from NXP have something completely different.
One of the most consistent aspects of ARM Cortex is that every manufacturer seems to offer 16 interrupt levels. No more. No less. It this regard, ARM Cortex is very similar to a TMS9900. This was a heavy-weight 3MHz mini-computer core which, entertainingly, was also the core of Speak 'N' Spell, as used by E.T. to phone home. This also has similarities to the RCA1802 which is most famous for being used in satellites.
These designs come from a very different age where open-reel tape (digital and analog) was the norm and bubble memory was just around the corner. Can you imagine a system with 2KB RAM and 4MB of solid state storage? That almost happened. Indeed, retro-future fiction based on this scenario would be fantastic.
The RCA1802 architecture is of particular interest. They hadn't quite got the idioms for subroutines. Instead, there was a four bit field to determine which of the 16 × 16 bit general purpose registers would be the program counter. Calling a subroutine involved loading the subroutine address into a register and setting the field so that the register became the new program counter. In this case, execution continued from the specified address. Return was implemented by setting the field to the previous state. This has some merit because leaf subroutines typically use less registers. So, it isn't critical if general purpose registers become progressively filled with return addresses. Indeed, if you want to sort 12 × 16 bit values, an RCA1802 does the job better than a 68000 or an ARM processor in Thumb mode. However, in general, every subroutine has to know the correct bit pattern of the four bit field for successful return. This create a strict hierarchy among subroutines and, in some cases, requires trampolines.
The TMS9900 architecture does many things in multiples of 16. It has a great instruction set which uses the 16 bit instruction space efficiently. It has 16 interrupt levels. And it keeps context switching fast by having 16 × 16 bit registers in main memory, accessed via a workspace pointer. Indeed, there are aspects of this design which seem to have influenced early iterations of SPARC and ARM.
If we mix features of the RCA1802 and the TMS9900, we have a system which responds to interrupts within one instruction cycle. Specifically, if we take a TMS9900 interrupt priority encoder and use it to select an RCA1802 register as a program counter then nested interrupts occurs seemlessly and without delay.
To implement this on a contemporary processor, we would have 16 program counters. Where virtual memory is implemented, we probably want multiple memory maps and possibly a stack pointer within each memory map. However, if all program counters initialize to known addresses after reset, we do not require any read or write access to shadow program counters. Even if a processor is virtualizing itself, by the Popek and Goldberg virtualization requirements, it is possible devise a scheme which never requires access to a shadow program counter.
In the worst case, we incur initial overhead of pushing general registers onto stack. (A sequence which may be nested as interrupt priority increases.) We also incur overhead of restoring general registers. However, rather than loading alternate program counters with an address from a dedicated register or always indirecting from a known memory location, we unconditionally jump one instruction before the interrupt handler and execute a (privileged) yield priority instruction. On a physical processor, this leaves the alternate program counter with the most useful address. On a virtual processor, privilege violation leads to inspection of the instruction and that indicates that the address should be stored elsewhere.
On the first interrupt of a given priority, we require a trampoline to the interrupt handler. But on all subsequent calls, we run the interrupt handler immediately. It typically starts by pushing registers onto a local or global stack but that is because state which is specific to interrupts may be nothing more than a shadow program counter for each interrupt priority level and a global interrupt priority field. No instructions are required to read the internal state of the interrupt system. One special instruction is required to exit interrupt and this is typical on many systems.
My Ideal Processor, Part 3
(This is the 49th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
ARM is a mess. Skip details if too complicated. When I see "ARM Cortex", I translate this as "ARM, with the interrupt controller which is incompatible with every other variant of ARM". Indeed, when we say ARM, is that ARM with the 26 bit address space and dedicated zero register, ARM with the 32 bit address space (with or without Java hardware, with or without one of the Thumb modes, with or without one of the SIMD variants) or ARM with 64 bit registers? And that's before we get to ARM licensee variants. So, for example, Atmel ARM processors are configured very much like micro-controllers. I/O pins may be set in groups of 32 to be digital inputs or digital outputs (with optional integral pull-up or pull-down resistors) or over-ridden to provide other functions. Broadcom's BCM8255, used in some models of Raspberry Pi, has a banks of 30 bit registers where I/O pins can be set in groups of 10 and potential be set to one of eight functions. I presume ARM processors from NXP have something completely different.
One of the most consistent aspects of ARM Cortex is that every manufacturer seems to offer 16 interrupt levels. No more. No less. It this regard, ARM Cortex is very similar to a TMS9900. This was a heavy-weight 3MHz mini-computer core which, entertainingly, was also the core of Speak 'N' Spell, as used by E.T. to phone home. This also has similarities to the RCA1802 which is most famous for being used in satellites.
These designs come from a very different age where open-reel tape (digital and analog) was the norm and bubble memory was just around the corner. Can you imagine a system with 2KB RAM and 4MB of solid state storage? That almost happened. Indeed, retro-future fiction based on this scenario would be fantastic.
The RCA1802 architecture is of particular interest. They hadn't quite got the idioms for subroutines. Instead, there was a four bit field to determine which of the 16 × 16 bit general purpose registers would be the program counter. Calling a subroutine involved loading the subroutine address into a register and setting the field so that the register became the new program counter. In this case, execution continued from the specified address. Return was implemented by setting the field to the previous state. This has some merit because leaf subroutines typically use less registers. So, it isn't critical if general purpose registers become progressively filled with return addresses. Indeed, if you want to sort 12 × 16 bit values, an RCA1802 does the job better than a 68000 or an ARM processor in Thumb mode. However, in general, every subroutine has to know the correct bit pattern of the four bit field for successful return. This create a strict hierarchy among subroutines and, in some cases, requires trampolines.
The TMS9900 architecture does many things in multiples of 16. It has a great instruction set which uses the 16 bit instruction space efficiently. It has 16 interrupt levels. And it keeps context switching fast by having 16 × 16 bit registers in main memory, accessed via a workspace pointer. Indeed, there are aspects of this design which seem to have influenced early iterations of SPARC and ARM.
If we mix features of the RCA1802 and the TMS9900, we have a system which responds to interrupts within one instruction cycle. Specifically, if we take a TMS9900 interrupt priority encoder and use it to select an RCA1802 register as a program counter then nested interrupts occurs seemlessly and without delay.
To implement this on a contemporary processor, we would have 16 program counters. Where virtual memory is implemented, we probably want multiple memory maps and possibly a stack pointer within each memory map. However, if all program counters initialize to known addresses after reset, we do not require any read or write access to shadow program counters. Even if a processor is virtualizing itself, by the Popek and Goldberg virtualization requirements, it is possible devise a scheme which never requires access to a shadow program counter.
In the worst case, we incur initial overhead of pushing general registers onto stack. (A sequence which may be nested as interrupt priority increases.) We also incur overhead of restoring general registers. However, rather than loading alternate program counters with an address from a dedicated register or always indirecting from a known memory location, we unconditionally jump one instruction before the interrupt handler and execute a (privileged) yield priority instruction. On a physical processor, this leaves the alternate program counter with the most useful address. On a virtual processor, privilege violation leads to inspection of the instruction and that indicates that the address should be stored elsewhere.
On the first interrupt of a given priority, we require a trampoline to the interrupt handler. But on all subsequent calls, we run the interrupt handler immediately. It typically starts by pushing registers onto a local or global stack but that is because state which is specific to interrupts may be nothing more than a shadow program counter for each interrupt priority level and a global interrupt priority field. No instructions are required to read the internal state of the interrupt system. One special instruction is required to exit interrupt and this is typical on many systems.