By using some downtime productively, I devised a virtual processor architecture which is a fairly inoffensive blend of 3-address RISC processor architectures. It is most similar to an early ARM or early Thumb architecture. However, it uses BER to compact instructions. This is much like UTF-8 encoding. Although it only has eight general purpose registers, they hold integers and floating point values and all operations on small data types are duplicated across the width of the register. This is much like the SIMD as defined in FirePath. I hoped to implement something like the upward compatibility of of the Transputer architecture. Specifically, 16 bit software is upwardly compatible with 32 bit registers or possibly more. My hope has been greatly exceeded because it is feasible to implement 8 bit registers in an 8 bit address space and individual bit addressing is possible but less practical. At the other end of the scale, 64 bit is practical and 1024 bit or more is possible.
It has irked me that processor architectures are advertised as having 2^r general purpose registers and d addressing modes. I then discover that one (or typically two) of the "general purpose" registers is a program counter and micro-codedcall stack - and that many of the addressing modes rely on these registers being mapped within a register set. For example, branch gets implemented as a register move to program counter and a relative branch is addition to program counter. Stack manipulation similarly relies on re-use of generic addressing modes. Furthermore, one stack with a mix of data and return addresses is encouraged because it uses the least registers. Anything which deviates from this group-think requires additional hardware, additional instructions and incurs a slight impedence mis-match with historical code. However, the use of generic addressing modes relies upon a conflation of data register size and address pointer size. With heavy processing applications, such as multi-media, we have GPUs with a bus width of 1024 bits and an Xtensa processor architecture with register width up to 1024 bits.
Beyond 16 bits, binary addition becomes awkward. 32 bit flat memory models and 64 bit flat memory models have been accommodated by smaller circuits and the resulting increase in execution speed. However, rather than targeting a 32 bit, "bare metal" RISC architecture (or RISC pretending to be CISC), in the general case, it would have been preferable to target a 64 bit virtual machine and allow more varied implementation. (Yes, I sound like a curmudgeon who is ready to work in a mainframe environment.)
Virtual machines vary widely. For example, Perl and Python have bytecode which includes execution of regular expressions. Java has an infamous "invoke virtual method" although that hasn't discouraged Sun's Java hardware or ARM's Jazelle. What I'm suggesting is keeping it real. 100% execution on a hypothetical processor and only invoking unknown instruction exception handling on lesser implementations, such as hardware without FPU. I expect this will "gracefully degrade" like Sun's and SGI's handling of SPARC and MIPS operating system updates. In practice this becomes unusable because heavy use of the latest instructions increases the proportion of time spent emulating instructions. This occurs on legacy hardware which is already under-powered. I presume that Apple has been doing similar in addition to restricting battery usage. Unless a conscious effort is made to drop out of a cycle of consumption, for example, by fixing hardware and operating system during development, most users of closed-source and open-source applications and operating systems are bludgeoned into accepting updates to maintain a pretense of security. Users have a "choice" along a continuum of fast and secure. However, the end points are rather poor and this encourages cycles of hardware consumption and license upgrades. This is "Good, fast, cheap. Choose any two." combined with planned obsolescence.
Fortunately, we can use the consequences of miniturization to our advantage. Specifically, 2010s micro-controllers exceed the capabilities of 1960s mainframes. They have more memory and wider data paths. They invariably have better instruction sets and more comprehensive handling of floating point numbers. They have about 1/100000 of the execution latency, about 1/100000 of the cost and about 1/100000 of the size and weight. Unfortunately, security and reliability has gone backwards. A car analogy is a vehicle which costs US$1, has the mass of a sheet of paper, travels one million miles on one tank of hydrocarbons and causes one fatal crash every three miles - if you can inexplicably get it to start or stay running.
This is the problem. Security and reliability are unacceptably low. Consumers don't know that it is possible to have a computer with one year of uptime or significantly more. Proper, dull practice is to fix bugs before adding features. However, market economics rewards the opposite behavior. Microsoft was infused with neighboring Boeing's approach to shaking out bugs. I'm not sure that bugs introduced by programmers should be handled like defects in metallurgy. In software, major causes of failure are bit flips and your own staff. Regardless, Microsoft, a firm proponent of C++, adopted a process of unit testing, integration testing, smoke testing, shakedown testing and other methodolgies which were originally used to assemble airplanes. However, while Boeing used this to reduce fatality and the associated poor branding, there is a perception that Microsoft deliberately stops short of producing "low defect" software. That's because an endless cycle of consumption is severely disrupted by random dips in reliability. Users (and our corporate overlords) accept a system with an uptime of two hours if it otherwise increases productivity. For scientific users, a 1960s punch card job or a 1970s interactive session on a mini-computer saved one month of slide rule calculations and often provided answers to six decimal figures. Contemporary users want an animated poo emoji so that they can conspicuously signal their consumption of blood minerals.
I've spent more than eight months trying to compile open source software in a manner which is secure and reliable. The focus was process control on ARM minus ECC but the results are more concerning. I didn't progress to any user interface software beyond a shell or text editor. Specifically, bash and vim. I was unable to bootstrap anything and I was only able to compile one component if numerous other components were already present. An outline of the mutually dependent graph is that clang and llvm require another compiler to provide functionality such as a mathematics library. Invariably, this comes from gcc. Historically, gcc had a mutual dependency with GNUmake. Both now have extensive dependencies upon other GNU packages. One package has a hard dependency on Perl for the purpose of converting documentation to a human readable format without creating a mutual loop within GNU packages. In addition to core GNU packages being directly or indirectly dependent upon Artistic License software, it only throws the mutual loop much wider. I was able to patch out the hard dependency for one version of software but I was using Perl and GNU make to do it. Even if I had taken another path to do it, I would, at best, be dependent upon a GNU or BSD license kernel compiled with a GNU or BSD license compiler. Market economics mean that GNU and BSD kernel distributors include thousands of other packages. Via Perl repositories, Python repositories, Ruby repositories, JavaScript repositories, codec repositories, dictionaries and productivity scripts, their may be indirect support for 35000 packages or more. Any component within a package may require more than 100 other components and the dependencies are often surprising because few are aware of the dependency graph. We also have the situation where some packages are gutted and re-written on a regular basis and some open source project incur management coup. This rarely concerns distributors. That would be acceptable if everyone was competent and trustworthy. This is far from the case and we seem to have lost our way. In particular:-
Coupling between software components is too high and it is increasing.
If you want to run mainstream software, we have completely lost the ability to bootstrap from the toggle switches.
Our software has no provinence.
Cannot trust our compilers.
Cannot trust any other software to work as described.
Cannot trust hardware.
A movement of citizens can capture a project.
A false movement can capture a project.
Covert infiltration is known to be widespread and occurring at every level.
This made me despondant. Was there any other outcome? When the new rich use novelty to signal wealth, was there any other outcome? When governments and corporations routine abuse individuals, was there any other outcome? With industrial espionage and dragnet snooping, was there any other outcome? I considered contributing to BSD. I like the empirical approach to fixing bugs found on your own choice of hardware combined with the accumulated knowledge of open source software. However, in practice, it involves decades of Unix cruft layered over decades of x86 cruft. I also considered buying a typewriter to get some privacy. However, I'd probably use it someone else's network printer/scanner/photocopier and/or many of the recipients would use larger scanning systems. Taking into account the acoustic side-channel of a typewriter, I'd probably have less security than using pen and paper while inconveniencing myself.
However, the answer to security and reliability is in a statement of the problem. We have to bootstrap from the toggle switches. We have to regain provinence of our compilers. Forget about network effects, economic pressure or social pressure. The official report about the Internet Worm Of 1988 was that no implementation should be used in more than 20% of cases. That means we need a minimum of five processor architectures, five compilers, five operating systems without shared code, five databases, five web servers, five web browsers and five productivity packages. That's the minimum. And what have we got? Two major implementations of x86, one major source for ARM and almost everything else is niche and/academic research. Compilers are even worse. Most embedded compilers use a fork of gcc which lack automatic security features of the main branch. clang and the proprietary compilers aren't sufficient to establish compiler trust. Ignoring valid concerns about hardware and kernels, anyone outside of a proprietary compiler vendor has difficulty getting three sets of mutually compatible compiler source code in one place. This is for the purpose of cross-compiling all of them and checking trustworthy output. If that isn't easy to replicate then everything else is fringe.
I envision SPIT: the Secure, Private Internet of Things. Or perhaps SPRINT: the Secure, Private, Reliable InterNet of Things. (This is why you shouldn't let geeks name things.) I start with a dumb bytecode interpreter. This consists of case statements in a loop. This implements a fetch-execute cycle of a processor. Or a shell. Or a interpreted language's interactive prompt. Although the principle is general, the intention is to make something which can be implemented as hardware without supporting software. Initially, it uses an untrusted host operating system and an untrusted compiler. That is sufficient for design, benchmarking and instrumentation, such as counting branches and memory accesses. An assembler allows a developer to get an intuitive understanding of the processor architecture. This allows idioms to be developed. This is useful for efficient compiler output. A two-pass assembler also provides forward reference resolving for a trivial one-pass compiler. (By chance, I've been given a good book on this topic: Compilers - Principles, Techniques And Tools by Princeton doctorates Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman.)
From this point, it is possible to build circuits and/or deploy virtual machines which perform practical tasks. It is also possible to implement functionality which is not universal, such as parity check on registers and memory, backchecking all computation and/or checkpoints and high availability. Checkpoints can be implemented independantly of assembler of compiler but may be beneficial to provide explicit checkpoints.
There are a large number of choices which are not constrained by this process:-
Is there conflation between interrupts and exceptions?
After deciding all of this and more, few constraints have been placed upon ALU, FPU or instruction format. Specifically, almost every ALU and FPU function can be specified as an instruction prefix. For example, I devised a 3-address processor architecture where the default instruction was a 2-address move between eight registers. This is encoded within 1 byte as 00-sss-ddd. Escape sequences convert this instruction into addition, subtraction and bit operations. This is encoded as 01-000-aaa for common ALU functions and 01-001-bbb for rare ALU functions. Instructions may optionally use a third register and/or wider range of registers. This is encoded as 10-ppp-rrr. Unfortunately, this exercise is far less efficient than 8080 derivatives, such as Z80 and x86. In particular, the heavy use of prefixes leads to combinatorial explosion of duplicate encodings. Regardless, escapes have their place. For example, rather than providing conditional branch, PIC micro-controllers have instructions to optionally ignore the next instruction. This allows conditional branches, conditional move, conditional arithmetic and conditional subroutines. Whereas, VAXmini-computers implement addressing modes as suffix codes. The difficulty with escapes is that they generally reduce execution speed. Although it is possible to implement instruction decode at the rate of 4 or 5 words per clock cycle, it is more typical for implementation to be the inverse: 4 or more clock cycles per word. This is greatly complicated when variable length instructions may cross instruction cache boundary or virtual memory page boundary.
RISC solves these problems and others but the code density is poor. 32 bit instructions are impractical. 16 bit encodings (mine included) are poor unless longer instructions are allowed. Xtensa takes the unusual approach that instruction opcode implicitly determines a 16 bit instruction or 24 bit instruction. I've found that an explicit bit per byte allows code density to match or exceed the code density of 32 bit ARM instructions and 16 bit Thumb instructions.
Starting with 32 bit ARM instructions, the 4 bit condition field is rarely used and the remainder is 28 bits which can be packed into 1, 2, 3 or 4 bytes using BER format where the top bit of each byte indicates a subsequent byte. This is similar to UTF-8 but without the need to be self-synchronizing. Some of the saving can be spent on lost conditional functionality. Yes, this may require additional clock cycles. But the overall saving on a system with an instruction cache would be worthwhile. 16 bit Thumb instructions make the saving less pronounced but having additional lengths (and no additional escape bits in total) means that a trivial encoding in BER format is more efficient than Thumb encoding. If fields are arranged so that the least frequent encodings occur at one end then the efficiency gain can be amplified. BER format optionally allows longer instructions and therefore it becomes relatively trivial to reach or exceed the peak 40 bit per clock cycle instruction decode of an IntelXeon.
So, any conventional 3-address RISC instruction set with suitably arranged fields may be a practical implementation with practical code density and practical execution speed. The trick is to not screw it up. Don't do anything too radical. Indeed, use of BER instruction packing allows other changes to be de-coupled with confidence. Knowing that the optimal instruction has not been chosen, it remains possible to implement a virtual processor, implement an assembler, implement a trivial compiler, implement an optimized compiler and then improve the instruction format without creating or modifying any compiler. This can be achieved by making cursory changes to the virtual machine to obtain statistics. Then changes can be made to the assembler and virtual machine to accommodate changes to instruction encode and instruction decode.
Some inefficiency may be beneficial for some implementations. Split fields are of minimal consequence for a hardware implementation or FPGA but may be very awkward for a general purpose computer. Likewise, some opcode arrangements may be beneficial for GPU but awkward for FPGA. In particular, omission of a flag register is very beneficial for software implementation but detrimental to composibility of instructions and therefore detrimental to code density. However use of conditional prefix instructions is also detrimental. They require implicit state to be held for one instruction cycle. This adversely affects interrupt response. Hidden state complicates a design. In this case, hidden flags cannot be dumped anywhere so the instruction sequence must be executed atomically.
There is the less immediate concern of running a 32 bit and/or 64 bit virtual processor on more modest hardware. This is a classic mainframe technique. Motorola, Tandem and others had less success replicating this on cheaper hardware but there are minimal complaints about AtmelAVR support for 32 bit numbers and 64 bit numbers on processors with 8 bit registers and 16 bit registers. I'm one of the worst offenders but it isn't that bad and I'm a relatively rare corner case. I've made extensive effort to optimize code size and execution speed across common architectures. Sometimes these objective align and sometimes they don't. In addition to that, I've made extensive effort to avoid Arduino's patronising interface and instead use GNU make for compilation across Atmel SAM ARM and Atmel AVR. This is mostly a publicly available script plus compiler flags, configuration settings taken from previous work and archiving taken from previous work. Unfortunately, it drops the AVR library routine for arbitrary modulo. I presume division is similarly affected. Most people don't notice how this is implemented. Powers of two are reduced to bit shifts and bit masks but it is an irritation to not have, for example, modulo 10 and that's why I haven't published it.
For many applications, such as process control, a virtual processor would be indistinguishable from current practice while offering advantages. A 40MHz, 16 bit micro-controller is considered to be slow and under-powered. Regardless, many applications, such as hydroponic pump control, only require response within 15 seconds. I presume that systems such as chemical mixing and gas pipelines have similar bounds. Faster is better but reliable is best. If it is possible to increase reliability by a factor of 10 but it reduces speed by a factor of 1000 then there are circumstances where it is very worthwhile to trade a surplus resource.
My Ideal Processor, Part 9
(This is the 60th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
By using some downtime productively, I devised a virtual processor architecture which is a fairly inoffensive blend of 3-address RISC processor architectures. It is most similar to an early ARM or early Thumb architecture. However, it uses BER to compact instructions. This is much like UTF-8 encoding. Although it only has eight general purpose registers, they hold integers and floating point values and all operations on small data types are duplicated across the width of the register. This is much like the SIMD as defined in FirePath. I hoped to implement something like the upward compatibility of of the Transputer architecture. Specifically, 16 bit software is upwardly compatible with 32 bit registers or possibly more. My hope has been greatly exceeded because it is feasible to implement 8 bit registers in an 8 bit address space and individual bit addressing is possible but less practical. At the other end of the scale, 64 bit is practical and 1024 bit or more is possible.
It has irked me that processor architectures are advertised as having 2^r general purpose registers and d addressing modes. I then discover that one (or typically two) of the "general purpose" registers is a program counter and micro-coded call stack - and that many of the addressing modes rely on these registers being mapped within a register set. For example, branch gets implemented as a register move to program counter and a relative branch is addition to program counter. Stack manipulation similarly relies on re-use of generic addressing modes. Furthermore, one stack with a mix of data and return addresses is encouraged because it uses the least registers. Anything which deviates from this group-think requires additional hardware, additional instructions and incurs a slight impedence mis-match with historical code. However, the use of generic addressing modes relies upon a conflation of data register size and address pointer size. With heavy processing applications, such as multi-media, we have GPUs with a bus width of 1024 bits and an Xtensa processor architecture with register width up to 1024 bits.
Beyond 16 bits, binary addition becomes awkward. 32 bit flat memory models and 64 bit flat memory models have been accommodated by smaller circuits and the resulting increase in execution speed. However, rather than targeting a 32 bit, "bare metal" RISC architecture (or RISC pretending to be CISC), in the general case, it would have been preferable to target a 64 bit virtual machine and allow more varied implementation. (Yes, I sound like a curmudgeon who is ready to work in a mainframe environment.)
Virtual machines vary widely. For example, Perl and Python have bytecode which includes execution of regular expressions. Java has an infamous "invoke virtual method" although that hasn't discouraged Sun's Java hardware or ARM's Jazelle. What I'm suggesting is keeping it real. 100% execution on a hypothetical processor and only invoking unknown instruction exception handling on lesser implementations, such as hardware without FPU. I expect this will "gracefully degrade" like Sun's and SGI's handling of SPARC and MIPS operating system updates. In practice this becomes unusable because heavy use of the latest instructions increases the proportion of time spent emulating instructions. This occurs on legacy hardware which is already under-powered. I presume that Apple has been doing similar in addition to restricting battery usage. Unless a conscious effort is made to drop out of a cycle of consumption, for example, by fixing hardware and operating system during development, most users of closed-source and open-source applications and operating systems are bludgeoned into accepting updates to maintain a pretense of security. Users have a "choice" along a continuum of fast and secure. However, the end points are rather poor and this encourages cycles of hardware consumption and license upgrades. This is "Good, fast, cheap. Choose any two." combined with planned obsolescence.
Fortunately, we can use the consequences of miniturization to our advantage. Specifically, 2010s micro-controllers exceed the capabilities of 1960s mainframes. They have more memory and wider data paths. They invariably have better instruction sets and more comprehensive handling of floating point numbers. They have about 1/100000 of the execution latency, about 1/100000 of the cost and about 1/100000 of the size and weight. Unfortunately, security and reliability has gone backwards. A car analogy is a vehicle which costs US$1, has the mass of a sheet of paper, travels one million miles on one tank of hydrocarbons and causes one fatal crash every three miles - if you can inexplicably get it to start or stay running.
This is the problem. Security and reliability are unacceptably low. Consumers don't know that it is possible to have a computer with one year of uptime or significantly more. Proper, dull practice is to fix bugs before adding features. However, market economics rewards the opposite behavior. Microsoft was infused with neighboring Boeing's approach to shaking out bugs. I'm not sure that bugs introduced by programmers should be handled like defects in metallurgy. In software, major causes of failure are bit flips and your own staff. Regardless, Microsoft, a firm proponent of C++, adopted a process of unit testing, integration testing, smoke testing, shakedown testing and other methodolgies which were originally used to assemble airplanes. However, while Boeing used this to reduce fatality and the associated poor branding, there is a perception that Microsoft deliberately stops short of producing "low defect" software. That's because an endless cycle of consumption is severely disrupted by random dips in reliability. Users (and our corporate overlords) accept a system with an uptime of two hours if it otherwise increases productivity. For scientific users, a 1960s punch card job or a 1970s interactive session on a mini-computer saved one month of slide rule calculations and often provided answers to six decimal figures. Contemporary users want an animated poo emoji so that they can conspicuously signal their consumption of blood minerals.
I've spent more than eight months trying to compile open source software in a manner which is secure and reliable. The focus was process control on ARM minus ECC but the results are more concerning. I didn't progress to any user interface software beyond a shell or text editor. Specifically, bash and vim. I was unable to bootstrap anything and I was only able to compile one component if numerous other components were already present. An outline of the mutually dependent graph is that clang and llvm require another compiler to provide functionality such as a mathematics library. Invariably, this comes from gcc. Historically, gcc had a mutual dependency with GNU make. Both now have extensive dependencies upon other GNU packages. One package has a hard dependency on Perl for the purpose of converting documentation to a human readable format without creating a mutual loop within GNU packages. In addition to core GNU packages being directly or indirectly dependent upon Artistic License software, it only throws the mutual loop much wider. I was able to patch out the hard dependency for one version of software but I was using Perl and GNU make to do it. Even if I had taken another path to do it, I would, at best, be dependent upon a GNU or BSD license kernel compiled with a GNU or BSD license compiler. Market economics mean that GNU and BSD kernel distributors include thousands of other packages. Via Perl repositories, Python repositories, Ruby repositories, JavaScript repositories, codec repositories, dictionaries and productivity scripts, their may be indirect support for 35000 packages or more. Any component within a package may require more than 100 other components and the dependencies are often surprising because few are aware of the dependency graph. We also have the situation where some packages are gutted and re-written on a regular basis and some open source project incur management coup. This rarely concerns distributors. That would be acceptable if everyone was competent and trustworthy. This is far from the case and we seem to have lost our way. In particular:-
This made me despondant. Was there any other outcome? When the new rich use novelty to signal wealth, was there any other outcome? When governments and corporations routine abuse individuals, was there any other outcome? With industrial espionage and dragnet snooping, was there any other outcome? I considered contributing to BSD. I like the empirical approach to fixing bugs found on your own choice of hardware combined with the accumulated knowledge of open source software. However, in practice, it involves decades of Unix cruft layered over decades of x86 cruft. I also considered buying a typewriter to get some privacy. However, I'd probably use it someone else's network printer/scanner/photocopier and/or many of the recipients would use larger scanning systems. Taking into account the acoustic side-channel of a typewriter, I'd probably have less security than using pen and paper while inconveniencing myself.
However, the answer to security and reliability is in a statement of the problem. We have to bootstrap from the toggle switches. We have to regain provinence of our compilers. Forget about network effects, economic pressure or social pressure. The official report about the Internet Worm Of 1988 was that no implementation should be used in more than 20% of cases. That means we need a minimum of five processor architectures, five compilers, five operating systems without shared code, five databases, five web servers, five web browsers and five productivity packages. That's the minimum. And what have we got? Two major implementations of x86, one major source for ARM and almost everything else is niche and/academic research. Compilers are even worse. Most embedded compilers use a fork of gcc which lack automatic security features of the main branch. clang and the proprietary compilers aren't sufficient to establish compiler trust. Ignoring valid concerns about hardware and kernels, anyone outside of a proprietary compiler vendor has difficulty getting three sets of mutually compatible compiler source code in one place. This is for the purpose of cross-compiling all of them and checking trustworthy output. If that isn't easy to replicate then everything else is fringe.
I envision SPIT: the Secure, Private Internet of Things. Or perhaps SPRINT: the Secure, Private, Reliable InterNet of Things. (This is why you shouldn't let geeks name things.) I start with a dumb bytecode interpreter. This consists of case statements in a loop. This implements a fetch-execute cycle of a processor. Or a shell. Or a interpreted language's interactive prompt. Although the principle is general, the intention is to make something which can be implemented as hardware without supporting software. Initially, it uses an untrusted host operating system and an untrusted compiler. That is sufficient for design, benchmarking and instrumentation, such as counting branches and memory accesses. An assembler allows a developer to get an intuitive understanding of the processor architecture. This allows idioms to be developed. This is useful for efficient compiler output. A two-pass assembler also provides forward reference resolving for a trivial one-pass compiler. (By chance, I've been given a good book on this topic: Compilers - Principles, Techniques And Tools by Princeton doctorates Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman.)
From this point, it is possible to build circuits and/or deploy virtual machines which perform practical tasks. It is also possible to implement functionality which is not universal, such as parity check on registers and memory, backchecking all computation and/or checkpoints and high availability. Checkpoints can be implemented independantly of assembler of compiler but may be beneficial to provide explicit checkpoints.
There are a large number of choices which are not constrained by this process:-
After deciding all of this and more, few constraints have been placed upon ALU, FPU or instruction format. Specifically, almost every ALU and FPU function can be specified as an instruction prefix. For example, I devised a 3-address processor architecture where the default instruction was a 2-address move between eight registers. This is encoded within 1 byte as 00-sss-ddd. Escape sequences convert this instruction into addition, subtraction and bit operations. This is encoded as 01-000-aaa for common ALU functions and 01-001-bbb for rare ALU functions. Instructions may optionally use a third register and/or wider range of registers. This is encoded as 10-ppp-rrr. Unfortunately, this exercise is far less efficient than 8080 derivatives, such as Z80 and x86. In particular, the heavy use of prefixes leads to combinatorial explosion of duplicate encodings. Regardless, escapes have their place. For example, rather than providing conditional branch, PIC micro-controllers have instructions to optionally ignore the next instruction. This allows conditional branches, conditional move, conditional arithmetic and conditional subroutines. Whereas, VAX mini-computers implement addressing modes as suffix codes. The difficulty with escapes is that they generally reduce execution speed. Although it is possible to implement instruction decode at the rate of 4 or 5 words per clock cycle, it is more typical for implementation to be the inverse: 4 or more clock cycles per word. This is greatly complicated when variable length instructions may cross instruction cache boundary or virtual memory page boundary.
RISC solves these problems and others but the code density is poor. 32 bit instructions are impractical. 16 bit encodings (mine included) are poor unless longer instructions are allowed. Xtensa takes the unusual approach that instruction opcode implicitly determines a 16 bit instruction or 24 bit instruction. I've found that an explicit bit per byte allows code density to match or exceed the code density of 32 bit ARM instructions and 16 bit Thumb instructions.
Starting with 32 bit ARM instructions, the 4 bit condition field is rarely used and the remainder is 28 bits which can be packed into 1, 2, 3 or 4 bytes using BER format where the top bit of each byte indicates a subsequent byte. This is similar to UTF-8 but without the need to be self-synchronizing. Some of the saving can be spent on lost conditional functionality. Yes, this may require additional clock cycles. But the overall saving on a system with an instruction cache would be worthwhile. 16 bit Thumb instructions make the saving less pronounced but having additional lengths (and no additional escape bits in total) means that a trivial encoding in BER format is more efficient than Thumb encoding. If fields are arranged so that the least frequent encodings occur at one end then the efficiency gain can be amplified. BER format optionally allows longer instructions and therefore it becomes relatively trivial to reach or exceed the peak 40 bit per clock cycle instruction decode of an Intel Xeon.
So, any conventional 3-address RISC instruction set with suitably arranged fields may be a practical implementation with practical code density and practical execution speed. The trick is to not screw it up. Don't do anything too radical. Indeed, use of BER instruction packing allows other changes to be de-coupled with confidence. Knowing that the optimal instruction has not been chosen, it remains possible to implement a virtual processor, implement an assembler, implement a trivial compiler, implement an optimized compiler and then improve the instruction format without creating or modifying any compiler. This can be achieved by making cursory changes to the virtual machine to obtain statistics. Then changes can be made to the assembler and virtual machine to accommodate changes to instruction encode and instruction decode.
Some inefficiency may be beneficial for some implementations. Split fields are of minimal consequence for a hardware implementation or FPGA but may be very awkward for a general purpose computer. Likewise, some opcode arrangements may be beneficial for GPU but awkward for FPGA. In particular, omission of a flag register is very beneficial for software implementation but detrimental to composibility of instructions and therefore detrimental to code density. However use of conditional prefix instructions is also detrimental. They require implicit state to be held for one instruction cycle. This adversely affects interrupt response. Hidden state complicates a design. In this case, hidden flags cannot be dumped anywhere so the instruction sequence must be executed atomically.
There is the less immediate concern of running a 32 bit and/or 64 bit virtual processor on more modest hardware. This is a classic mainframe technique. Motorola, Tandem and others had less success replicating this on cheaper hardware but there are minimal complaints about Atmel AVR support for 32 bit numbers and 64 bit numbers on processors with 8 bit registers and 16 bit registers. I'm one of the worst offenders but it isn't that bad and I'm a relatively rare corner case. I've made extensive effort to optimize code size and execution speed across common architectures. Sometimes these objective align and sometimes they don't. In addition to that, I've made extensive effort to avoid Arduino's patronising interface and instead use GNU make for compilation across Atmel SAM ARM and Atmel AVR. This is mostly a publicly available script plus compiler flags, configuration settings taken from previous work and archiving taken from previous work. Unfortunately, it drops the AVR library routine for arbitrary modulo. I presume division is similarly affected. Most people don't notice how this is implemented. Powers of two are reduced to bit shifts and bit masks but it is an irritation to not have, for example, modulo 10 and that's why I haven't published it.
For many applications, such as process control, a virtual processor would be indistinguishable from current practice while offering advantages. A 40MHz, 16 bit micro-controller is considered to be slow and under-powered. Regardless, many applications, such as hydroponic pump control, only require response within 15 seconds. I presume that systems such as chemical mixing and gas pipelines have similar bounds. Faster is better but reliable is best. If it is possible to increase reliability by a factor of 10 but it reduces speed by a factor of 1000 then there are circumstances where it is very worthwhile to trade a surplus resource.