Stories
Slash Boxes
Comments

SoylentNews is people

Log In

Log In

Create Account  |  Retrieve Password


My Ideal Processor, Part 10

Posted by cafebabe on Tuesday January 16 2018, @08:10PM (#2930)
0 Comments
Hardware

(This is the 61st of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I've been working on a virtual processor and I've got benchmark figures. The virtual processor is intended for process control but it may have general use or use on FPGA, GPU, optical computer or mainframe. In particular, there is a market for object code portability and high uptime across x86, ARM, AVR and other processor architectures.

I've been working on the instruction set and I've worked through a long list of decisions which are loosely connected. These aren't choices which can be made in isolation and without consequence elsewhere in a design. However, each choice makes minimal progress because a sensible set of choices may lead to something impractical. For example, 64 registers × 64 bit requires a 4 kilobit context switch. Numerous registers are great. Wide registers are great. But fast interrupts are also great.

After going through this process repeatedly for more than three years, I've devised a fairly conventional processor architecture (3-address, Harvard architecture) which provides cross-compatibility for 8 bit, 16 bit, 32 bit, 64 bit or possibly larger registers and address pointers.

I wrote a fetch-execute instruction interpreter and then called it from a loop which tries all permutations of the 2^28 instructions. This crashed repeatedly. The optional floating point encountered divide by zero. After #include <signal.h> so that SIGFPE (floating point exception) could be set to SIG_IGN (ignore signal), it was trivial to set SIGUSR1 and SIGUSR2 to trigger virtual processor interrupts. It optionally outputs a PID file to make signal invocation easier. That's great fun. With one Unix command, it is possible to send virtual interrupts to a virtual processor.

Further testing found further problems in approximate order of difficulty. Apparently, I missed a bound check on a virtual stack. After these problems, the test ran smoothly. I expected the first full run of 256 million instructions to require about 30 minutes. It finished within five minutes. That's approximately one million instructions per second. After brief disassembly and seeing 140 byte stack references, I applied the standard flags. This reduced the frequency of stack references and their severity to 40 bytes. I also obtained the usual double performance. That's approximately two million virtual instructions per second for integer performance and 0.4 MIPS for a mixed integer and float workload. Optimistically, on a Raspberry Pi Model B Revision 2 with 750MHz clock and 697 BogoMIPS (approximately 14 instructions per 15 clock cycles), it takes a minimum of 132 clock cycles to go around the fetch-execute loop but many of the instructions exit early due to bogus opcodes, bound check failures or similar.

Despite being a 64 bit SIMD implementation which is vastly more powerful than a 16 bit computer from the 1980s, this arrangement captures less than 1% of the efficiency of the host architecture. I looked at the disassembly more closely. My code has addition of 8 bit integers and this involves a loop with eight fixed iterations. gcc correctly unrolls this loops and converts it to SIMD instructions. However, this doesn't occur for floating point instructions. This may be a limitation of the compiler or the target architecture. (ARMv6 with Neon.)

What am I trying to achieve? Optimistically, it is possible to achieve 1/3 efficiency. That would be one clock cycle to fetch a virtual instruction, one clock cycle for branch fan-out to each case statement and one clock cycle to implement a virtual instruction on native hardware. There is no need to return to start because duplicate fetch and decode can be placed at the end of each case. Therefore, bytecode interpreter execution flow may bounce around unstructured code as required.

Nowadays, I presume that fetch and branch incurs instruction pipeline stall of at least two clock cycles. Also, I'm attempting to implement 8 register × 64 bit on hardware which is 8 register × 32 bit. Furthermore, operations on arbitrary virtual registers (nominally R0-R7) requires transfer to internal registers (nominally A, B and C) where cases may perform A=B+C or A+=B*C. This particular step would be easier with a stack virtual machine but difficulties occur elsewhere.

Current efficiency is a mystery. Maximum efficiency is a mystery. My implementation won't match the execution speed of Dalvik which is deprecated in favor of ART [Android RunTime] which is claimed by Oracle to be have less than 1% of the efficiency of HotSpot in some circumstances. That's a familiar figure.

My interpreter and registers may exceed the size of the native processor cache. Like Dalvik, the opcode is in the least significant bits. This avoids bit shift prior to branch. That sounds clever and planned but it isn't possible to put an opcode elsewhere and get a broad range of 1 byte BER packed instructions. However, when testing instruction patterns sequentially, the cases of the instruction interpreter are called in rotation. This may incur significant cache churn. If this is the case, I might get significantly more throughput by testing a more realistic distribution of opcodes.

If this was intended to run games and apps, it would be fast enough to run Leisure Suit Larry but not fast enough to run Fruit Ninja.

My Ideal Processor, Part 9

Posted by cafebabe on Tuesday January 16 2018, @08:08PM (#2929)
0 Comments
Hardware

(This is the 60th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

By using some downtime productively, I devised a virtual processor architecture which is a fairly inoffensive blend of 3-address RISC processor architectures. It is most similar to an early ARM or early Thumb architecture. However, it uses BER to compact instructions. This is much like UTF-8 encoding. Although it only has eight general purpose registers, they hold integers and floating point values and all operations on small data types are duplicated across the width of the register. This is much like the SIMD as defined in FirePath. I hoped to implement something like the upward compatibility of of the Transputer architecture. Specifically, 16 bit software is upwardly compatible with 32 bit registers or possibly more. My hope has been greatly exceeded because it is feasible to implement 8 bit registers in an 8 bit address space and individual bit addressing is possible but less practical. At the other end of the scale, 64 bit is practical and 1024 bit or more is possible.

It has irked me that processor architectures are advertised as having 2^r general purpose registers and d addressing modes. I then discover that one (or typically two) of the "general purpose" registers is a program counter and micro-coded call stack - and that many of the addressing modes rely on these registers being mapped within a register set. For example, branch gets implemented as a register move to program counter and a relative branch is addition to program counter. Stack manipulation similarly relies on re-use of generic addressing modes. Furthermore, one stack with a mix of data and return addresses is encouraged because it uses the least registers. Anything which deviates from this group-think requires additional hardware, additional instructions and incurs a slight impedence mis-match with historical code. However, the use of generic addressing modes relies upon a conflation of data register size and address pointer size. With heavy processing applications, such as multi-media, we have GPUs with a bus width of 1024 bits and an Xtensa processor architecture with register width up to 1024 bits.

Beyond 16 bits, binary addition becomes awkward. 32 bit flat memory models and 64 bit flat memory models have been accommodated by smaller circuits and the resulting increase in execution speed. However, rather than targeting a 32 bit, "bare metal" RISC architecture (or RISC pretending to be CISC), in the general case, it would have been preferable to target a 64 bit virtual machine and allow more varied implementation. (Yes, I sound like a curmudgeon who is ready to work in a mainframe environment.)

Virtual machines vary widely. For example, Perl and Python have bytecode which includes execution of regular expressions. Java has an infamous "invoke virtual method" although that hasn't discouraged Sun's Java hardware or ARM's Jazelle. What I'm suggesting is keeping it real. 100% execution on a hypothetical processor and only invoking unknown instruction exception handling on lesser implementations, such as hardware without FPU. I expect this will "gracefully degrade" like Sun's and SGI's handling of SPARC and MIPS operating system updates. In practice this becomes unusable because heavy use of the latest instructions increases the proportion of time spent emulating instructions. This occurs on legacy hardware which is already under-powered. I presume that Apple has been doing similar in addition to restricting battery usage. Unless a conscious effort is made to drop out of a cycle of consumption, for example, by fixing hardware and operating system during development, most users of closed-source and open-source applications and operating systems are bludgeoned into accepting updates to maintain a pretense of security. Users have a "choice" along a continuum of fast and secure. However, the end points are rather poor and this encourages cycles of hardware consumption and license upgrades. This is "Good, fast, cheap. Choose any two." combined with planned obsolescence.

Fortunately, we can use the consequences of miniturization to our advantage. Specifically, 2010s micro-controllers exceed the capabilities of 1960s mainframes. They have more memory and wider data paths. They invariably have better instruction sets and more comprehensive handling of floating point numbers. They have about 1/100000 of the execution latency, about 1/100000 of the cost and about 1/100000 of the size and weight. Unfortunately, security and reliability has gone backwards. A car analogy is a vehicle which costs US$1, has the mass of a sheet of paper, travels one million miles on one tank of hydrocarbons and causes one fatal crash every three miles - if you can inexplicably get it to start or stay running.

This is the problem. Security and reliability are unacceptably low. Consumers don't know that it is possible to have a computer with one year of uptime or significantly more. Proper, dull practice is to fix bugs before adding features. However, market economics rewards the opposite behavior. Microsoft was infused with neighboring Boeing's approach to shaking out bugs. I'm not sure that bugs introduced by programmers should be handled like defects in metallurgy. In software, major causes of failure are bit flips and your own staff. Regardless, Microsoft, a firm proponent of C++, adopted a process of unit testing, integration testing, smoke testing, shakedown testing and other methodolgies which were originally used to assemble airplanes. However, while Boeing used this to reduce fatality and the associated poor branding, there is a perception that Microsoft deliberately stops short of producing "low defect" software. That's because an endless cycle of consumption is severely disrupted by random dips in reliability. Users (and our corporate overlords) accept a system with an uptime of two hours if it otherwise increases productivity. For scientific users, a 1960s punch card job or a 1970s interactive session on a mini-computer saved one month of slide rule calculations and often provided answers to six decimal figures. Contemporary users want an animated poo emoji so that they can conspicuously signal their consumption of blood minerals.

I've spent more than eight months trying to compile open source software in a manner which is secure and reliable. The focus was process control on ARM minus ECC but the results are more concerning. I didn't progress to any user interface software beyond a shell or text editor. Specifically, bash and vim. I was unable to bootstrap anything and I was only able to compile one component if numerous other components were already present. An outline of the mutually dependent graph is that clang and llvm require another compiler to provide functionality such as a mathematics library. Invariably, this comes from gcc. Historically, gcc had a mutual dependency with GNU make. Both now have extensive dependencies upon other GNU packages. One package has a hard dependency on Perl for the purpose of converting documentation to a human readable format without creating a mutual loop within GNU packages. In addition to core GNU packages being directly or indirectly dependent upon Artistic License software, it only throws the mutual loop much wider. I was able to patch out the hard dependency for one version of software but I was using Perl and GNU make to do it. Even if I had taken another path to do it, I would, at best, be dependent upon a GNU or BSD license kernel compiled with a GNU or BSD license compiler. Market economics mean that GNU and BSD kernel distributors include thousands of other packages. Via Perl repositories, Python repositories, Ruby repositories, JavaScript repositories, codec repositories, dictionaries and productivity scripts, their may be indirect support for 35000 packages or more. Any component within a package may require more than 100 other components and the dependencies are often surprising because few are aware of the dependency graph. We also have the situation where some packages are gutted and re-written on a regular basis and some open source project incur management coup. This rarely concerns distributors. That would be acceptable if everyone was competent and trustworthy. This is far from the case and we seem to have lost our way. In particular:-

  • Layered implementation is taught but not practiced.
  • Coupling between software components is too high and it is increasing.
  • If you want to run mainstream software, we have completely lost the ability to bootstrap from the toggle switches.
  • Our software has no provinence.
  • Cannot trust our compilers.
  • Cannot trust any other software to work as described.
  • Cannot trust hardware.
  • A movement of citizens can capture a project.
  • A false movement can capture a project.
  • Covert infiltration is known to be widespread and occurring at every level.

This made me despondant. Was there any other outcome? When the new rich use novelty to signal wealth, was there any other outcome? When governments and corporations routine abuse individuals, was there any other outcome? With industrial espionage and dragnet snooping, was there any other outcome? I considered contributing to BSD. I like the empirical approach to fixing bugs found on your own choice of hardware combined with the accumulated knowledge of open source software. However, in practice, it involves decades of Unix cruft layered over decades of x86 cruft. I also considered buying a typewriter to get some privacy. However, I'd probably use it someone else's network printer/scanner/photocopier and/or many of the recipients would use larger scanning systems. Taking into account the acoustic side-channel of a typewriter, I'd probably have less security than using pen and paper while inconveniencing myself.

However, the answer to security and reliability is in a statement of the problem. We have to bootstrap from the toggle switches. We have to regain provinence of our compilers. Forget about network effects, economic pressure or social pressure. The official report about the Internet Worm Of 1988 was that no implementation should be used in more than 20% of cases. That means we need a minimum of five processor architectures, five compilers, five operating systems without shared code, five databases, five web servers, five web browsers and five productivity packages. That's the minimum. And what have we got? Two major implementations of x86, one major source for ARM and almost everything else is niche and/academic research. Compilers are even worse. Most embedded compilers use a fork of gcc which lack automatic security features of the main branch. clang and the proprietary compilers aren't sufficient to establish compiler trust. Ignoring valid concerns about hardware and kernels, anyone outside of a proprietary compiler vendor has difficulty getting three sets of mutually compatible compiler source code in one place. This is for the purpose of cross-compiling all of them and checking trustworthy output. If that isn't easy to replicate then everything else is fringe.

I envision SPIT: the Secure, Private Internet of Things. Or perhaps SPRINT: the Secure, Private, Reliable InterNet of Things. (This is why you shouldn't let geeks name things.) I start with a dumb bytecode interpreter. This consists of case statements in a loop. This implements a fetch-execute cycle of a processor. Or a shell. Or a interpreted language's interactive prompt. Although the principle is general, the intention is to make something which can be implemented as hardware without supporting software. Initially, it uses an untrusted host operating system and an untrusted compiler. That is sufficient for design, benchmarking and instrumentation, such as counting branches and memory accesses. An assembler allows a developer to get an intuitive understanding of the processor architecture. This allows idioms to be developed. This is useful for efficient compiler output. A two-pass assembler also provides forward reference resolving for a trivial one-pass compiler. (By chance, I've been given a good book on this topic: Compilers - Principles, Techniques And Tools by Princeton doctorates Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman.)

From this point, it is possible to build circuits and/or deploy virtual machines which perform practical tasks. It is also possible to implement functionality which is not universal, such as parity check on registers and memory, backchecking all computation and/or checkpoints and high availability. Checkpoints can be implemented independantly of assembler of compiler but may be beneficial to provide explicit checkpoints.

There are a large number of choices which are not constrained by this process:-

  1. Does the processor architecture work in binary? Probably yes but other choices include balanced ternary, decimal, binary decimal and Chinese Remainder Theorem.
  2. Does it support negative numbers? Probably yes.
  3. Are negative numbers represented as sign and magnitude, one's compliment, two's compliment or other?
  4. Does it have opcodes and/or a program counter? Probably yes.
  5. Is it a pure Harvard architecture?
  6. What are the word sizes?
  7. Does it have stacks and/or registers?
  8. How many stacks?
  9. How many registers?
  10. Does it have registers for constants?
  11. Does it have direct access to memory?
  12. Does it have unrestricted access to memory?
  13. Does it have paging, segments, flat memory or other?
  14. Is there a constants segment?
  15. Does it allow unaligned memory access?
  16. Does it allow unaligned stack access?
  17. Does it have separate data registers and address registers and are they the same width?
  18. Does it support integer multiplication?
  19. Does it support integer division?
  20. Does it support integer modulo?
  21. Does it have support for floating point data?
  22. Does it support SIMD?
  23. Is there a conflation between SIMD, floating point, integers and/or pointers?
  24. Does it have one accumulator, multiple accumulators or registers which can be ganged?
  25. If it has a flag register then which flags are present?
  26. Which opcodes change which flags?
  27. If it doesn't have a flag register then what mechanism is provided for conditional execution?
  28. Is it a 2-address machine, 3-address machine, 4-address machine or other?
  29. Is it a pure load and store architecture?
  30. Does it take multiple operands from stack or memory?
  31. Does it allow memory-to-memory operations?
  32. Does it provide string operations?
  33. Does it provide arbitrary precision operations?
  34. Does it provide arbitrary size vector operations?
  35. Does it have support for multiple processors?
  36. Does it have support for co-processors?
  37. Does it have support for custom instructions?
  38. How does it handle interrupts?
  39. How does it handle exceptions?
  40. Is there conflation between interrupts and exceptions?

After deciding all of this and more, few constraints have been placed upon ALU, FPU or instruction format. Specifically, almost every ALU and FPU function can be specified as an instruction prefix. For example, I devised a 3-address processor architecture where the default instruction was a 2-address move between eight registers. This is encoded within 1 byte as 00-sss-ddd. Escape sequences convert this instruction into addition, subtraction and bit operations. This is encoded as 01-000-aaa for common ALU functions and 01-001-bbb for rare ALU functions. Instructions may optionally use a third register and/or wider range of registers. This is encoded as 10-ppp-rrr. Unfortunately, this exercise is far less efficient than 8080 derivatives, such as Z80 and x86. In particular, the heavy use of prefixes leads to combinatorial explosion of duplicate encodings. Regardless, escapes have their place. For example, rather than providing conditional branch, PIC micro-controllers have instructions to optionally ignore the next instruction. This allows conditional branches, conditional move, conditional arithmetic and conditional subroutines. Whereas, VAX mini-computers implement addressing modes as suffix codes. The difficulty with escapes is that they generally reduce execution speed. Although it is possible to implement instruction decode at the rate of 4 or 5 words per clock cycle, it is more typical for implementation to be the inverse: 4 or more clock cycles per word. This is greatly complicated when variable length instructions may cross instruction cache boundary or virtual memory page boundary.

RISC solves these problems and others but the code density is poor. 32 bit instructions are impractical. 16 bit encodings (mine included) are poor unless longer instructions are allowed. Xtensa takes the unusual approach that instruction opcode implicitly determines a 16 bit instruction or 24 bit instruction. I've found that an explicit bit per byte allows code density to match or exceed the code density of 32 bit ARM instructions and 16 bit Thumb instructions.

Starting with 32 bit ARM instructions, the 4 bit condition field is rarely used and the remainder is 28 bits which can be packed into 1, 2, 3 or 4 bytes using BER format where the top bit of each byte indicates a subsequent byte. This is similar to UTF-8 but without the need to be self-synchronizing. Some of the saving can be spent on lost conditional functionality. Yes, this may require additional clock cycles. But the overall saving on a system with an instruction cache would be worthwhile. 16 bit Thumb instructions make the saving less pronounced but having additional lengths (and no additional escape bits in total) means that a trivial encoding in BER format is more efficient than Thumb encoding. If fields are arranged so that the least frequent encodings occur at one end then the efficiency gain can be amplified. BER format optionally allows longer instructions and therefore it becomes relatively trivial to reach or exceed the peak 40 bit per clock cycle instruction decode of an Intel Xeon.

So, any conventional 3-address RISC instruction set with suitably arranged fields may be a practical implementation with practical code density and practical execution speed. The trick is to not screw it up. Don't do anything too radical. Indeed, use of BER instruction packing allows other changes to be de-coupled with confidence. Knowing that the optimal instruction has not been chosen, it remains possible to implement a virtual processor, implement an assembler, implement a trivial compiler, implement an optimized compiler and then improve the instruction format without creating or modifying any compiler. This can be achieved by making cursory changes to the virtual machine to obtain statistics. Then changes can be made to the assembler and virtual machine to accommodate changes to instruction encode and instruction decode.

Some inefficiency may be beneficial for some implementations. Split fields are of minimal consequence for a hardware implementation or FPGA but may be very awkward for a general purpose computer. Likewise, some opcode arrangements may be beneficial for GPU but awkward for FPGA. In particular, omission of a flag register is very beneficial for software implementation but detrimental to composibility of instructions and therefore detrimental to code density. However use of conditional prefix instructions is also detrimental. They require implicit state to be held for one instruction cycle. This adversely affects interrupt response. Hidden state complicates a design. In this case, hidden flags cannot be dumped anywhere so the instruction sequence must be executed atomically.

There is the less immediate concern of running a 32 bit and/or 64 bit virtual processor on more modest hardware. This is a classic mainframe technique. Motorola, Tandem and others had less success replicating this on cheaper hardware but there are minimal complaints about Atmel AVR support for 32 bit numbers and 64 bit numbers on processors with 8 bit registers and 16 bit registers. I'm one of the worst offenders but it isn't that bad and I'm a relatively rare corner case. I've made extensive effort to optimize code size and execution speed across common architectures. Sometimes these objective align and sometimes they don't. In addition to that, I've made extensive effort to avoid Arduino's patronising interface and instead use GNU make for compilation across Atmel SAM ARM and Atmel AVR. This is mostly a publicly available script plus compiler flags, configuration settings taken from previous work and archiving taken from previous work. Unfortunately, it drops the AVR library routine for arbitrary modulo. I presume division is similarly affected. Most people don't notice how this is implemented. Powers of two are reduced to bit shifts and bit masks but it is an irritation to not have, for example, modulo 10 and that's why I haven't published it.

For many applications, such as process control, a virtual processor would be indistinguishable from current practice while offering advantages. A 40MHz, 16 bit micro-controller is considered to be slow and under-powered. Regardless, many applications, such as hydroponic pump control, only require response within 15 seconds. I presume that systems such as chemical mixing and gas pipelines have similar bounds. Faster is better but reliable is best. If it is possible to increase reliability by a factor of 10 but it reduces speed by a factor of 1000 then there are circumstances where it is very worthwhile to trade a surplus resource.

My Ideal Processor, Part 8

Posted by cafebabe on Tuesday January 16 2018, @08:06PM (#2928)
0 Comments
Hardware

(This is the 59th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I defined a provisional instruction set. The final instruction set will be defined after empirical testing. Fields may be split, moved, removed and/or expanded. Opcodes may be re-arranged. However, there is enough to begin implementation and testing.

The basic concept for implementing a virtual machine is a set of case statements within a while loop. This implements the fetch-execute cycle. One instruction is fetched at the beginning of the while loop and it is decoded via a switch statement (or nested switch statements) and then each case implements an instruction or related functionality. Instructions can be as abstract as desired and may include "make robot walk", "set light to half brightness" or "fetch URL". In my case, I'm trying to define something which could be implemented as a real micro-processor. Even here, people have implemented hardware with instructions such as "calculate factorial", "find string length", "copy string" or "add arbitrary precision decimal number". Even this is too complicated. I'm intending to implement a virtual machine which would compare favorably with hardware from 25 years ago. This is partly for my sanity given how computers have become encapsulated to the extent that almost no-one understands how a serial stream is negotiated or encoded over USB. There is also the security consideration. I cannot secure my desktop. I certainly cannot secure a larger system. Anyone who claims that they can secure a network is ignorant or a liar. I'm not even sure that my storage or network interface run their supplied firmware.

I'd like something which targets Arduino, FPGA, GPU or credit card size computer (or smaller). We've got quad-core watches and a watch with RAID is immenent. With this level of miniturization, we can apply mainframe reliability techniques to micro-controllers. For example, we can run computation twice and check results; insert idempotent, hot-failover checkpoints between all observable changes of micro-controller state; or implement parity checking when it is not available on target hardware. These techniques have obvious limitations. However, embedded systems often have surplus processing power. When EPROM cost more than a week's wages, micro-controllers would decode Huffman compressed Forth bytecode or similar. Now we can use the surplus to increase reliability. The alternative is too awful to contemplate.

It is possible to have standardized object code which resumes on replacement hardware. This is like VMware for micro-controllers. In a trivial case, a LCD panel may run a clock application. The clock application checkpoints to a house server. When the panel fails, it would be possible to purchase a replacement panel and restore the application on the new panel. It may now run on a slower system with smaller display but within a minute or so, it should find its time source, adjust to the new display size and otherwise display time to your preferences.

A more practical example would be a hydroponic controller. People are developing I/O libraries which allow relay control over Ethernet with minimal authentication, no error checks (between devices with no memory integrity) and no hardware interlocks. For your own safety, please don't do this. A more sensible approach is to run two instances of the firmware. One instance runs locally in a harsh environment where humidity may reach 100% and temperature may fall below 0°C. The other instance runs in a controlled environment which is always dry and at room temperature. Both instances run integrity checks but no relays get triggered unless both instances computation the same results. Alternatively, the instance local to the relays may continue with less oversight while the server container sends alerts about the lack of monitoring. For a large environment, it is possible to use the standard database technique of a double or triple commit to ensure that servers have a consistent state prior to server hot-failover. This can work in conjunction with console access, graphical remoting and centralized OLAP over low-bandwidth networks.

Sun Microsystems said "The network is the computer." and then gave us Java. This only provided hot-failover within the Enterprise Java Bean environment. I'm proposing a system where low-bandwidth process control runs in two or more places, has the convenience of Android, the reliability of an IBM mainframe, the openness of p-code and the security of erm, erm. That part has been lacking for quite a while.

My Ideal Processor, Part 7

Posted by cafebabe on Tuesday January 16 2018, @08:04PM (#2927)
0 Comments
Hardware

(This is the 58th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I had some time to kill in a hospital so I took some paper and solved a problem which has been annoying me for more than three years. I wanted to implement an instruction set with good code density while maintaining an efficient hardware implementation. This requires the presentation of one or more useful instruction fields in the approximate frequency in which they are required. It also requires using the majority of bit permutations within each field. Presenting useful fields is trivial. Popular processor architectures like IBM 360, DEC VAX and Intel x86 make instruction length implicit as part of an instruction. RISC processor architectures insist on one size instructions (usually with exceptions for constants). ARM Thumb re-uses a (rarely used) conditional field as a method of compacting many of its 32 bit instructions to 16 bit respesentations. However, the Xtensa processor architecture shows that good code density can be achieved by straddling a golden ratio with 2 byte instructions and 3 byte instructions where the length is implicit to the instruction.

I've found that similar results can be achieved by making instruction size explicit rather than implicit. Specifically, through the use of BER. With this encoding, bytes are split into a one bit field and a seven bit field. The top bit indicates if another byte follows. This allows numeric values of any size to be encoded in multiples of seven bits. In particular, it allows processor instructions to be encoded as 7 bits, 14 bits, 21 bits, 28 bits or more. It is also possible to choose representations other than octets or where escape bits are skipped. So, for example, 2 byte, 3 byte and 5 byte encoding would only require two or three escape bits among 37 bits or significantly more. This allows detailed instructions to be defined while programs match or exceed to code density of AVR or ARM. Explicit escapes appear to reduce code density but it provides more options for compaction while simplifying instruction decode.

Within an unpacked instruction, fields can be arranged (or split) to minimize average instruction length. Where fancy functionality is rarely used, such as ARM's pre-scaler, the default bit permutation of zero requires no space when packed as BER. Another trick, most useful within a 3-address machine, is that one reference to a register may be relative to another. This allows compact representation for common cases, such 2-address instructions. An example would be g+=h which can also be written as g=g+h or g=h+g. An instruction which references g more than once may use zero in subsequent references. Obviously, this requires all references to be skewed but this can be achieved with XOR or addition such that common cases have zero in desired fields. This allows an instruction to be compacted more efficiently.

A worked example would be 5 bits for opcode, 2 bits for addressing mode and 3 bits for each of three register references. Ordinarily, this would require 16 bits. I've found 16 bit encoding to be quite restrictive. The conventional approach is an implicit sequence of additional 16 bit words. However, with BER and skewed register references, 2-address instructions only require 13 bits. This fits into 2 bytes while retaining fast decode of significantly longer instructions. In practice, instruction size is similar to AMD's VEX prefix but implemented cleanly rather than being bolt-on cruft. This arrangement also satisfies the criteria of an instruction-space as a binary fraction where there is a 1 byte breakpoint instruction and numerous useful instructions with 1 byte and 2 byte encoding. Although there is no explicit bias within a set of general registers, if operand references cross encoded byte boundaries then use of some registers leads to a more compact representation. This runs faster - even within a virtual machine. This provides opportunities to tweak code efficiency within a conceptually simple model but it isn't essential.

This was all known to me but the piece that had been annoying me for more than three years was instruction set representation of a fully functional processor. It is easy to take specific cases, like g=h-i, or common optimizations, such as a-=c but it is more difficult to devise ALU, stack and main memory interface with realistic code density. A designer must ensure that there are a practical set of instructions which are sufficiently over the break-even point. Also have to ensure that bit patterns are only allocated to one task. When I worked on a rigidly fixed 16 bit instruction set in which opcode could appear within each of four fields, I had numerous false successes in which the same bit patterns represented two or more operations. A more conventional division of opcode and operands reduces this problem but leads to the opposite problem of underwhelming allocation. This may occur directly where a common instruction requires a long encoding or indirectly where idioms require an inordinately long sequence of instructions. There are two examples here.

x86 is generally regarded as achieving the best code density among widely deployed systems. The method by which this is achieved is questionable market economics. It is also achieved by re-cycling obsolete bit patterns. For example, the INT3 breakpoint instruction. Or 0xF0, which was a 16 bit stack decrement and now a prefix for MMX and suchlike. However, the legacy 16 bit mode has a very curious wrinkle. 20 bit addressing (1MB) is achieved be pre-scaling 16 bit segment pointers by four bits. In 2018, an armchair critic can easily suggest that pre-scaling by eight bits would have given a 24 bit address-space (16MB). However, this was devised in an era when systems shipped with 1KB RAM or less. (The 1976 pre-release documentation for 6502 describes implementation of systems with 256 bytes or less.) Wasting up to 15 bytes between each of four segments was considered positively profligate. However, the astounding part comes from the rare case when pointers between segments are tested for equality. In this case, Intel's macro assembler generated a 44 byte sequence. That's excessive.

We've mostly dumped that historical baggage but the problem of addition and multiplication remain. This is the most pertinent example. When implementing multiplication, typically upon signed, two's compliment binary integers, there are several choices:-

  1. Signed multiplication only.
  2. Unsigned multiplication only.
  3. Two variants. Both inputs signed or unsigned.
  4. Four variants. Any input signed or unsigned.
  5. Three variants. This is from the observation that multiplication is commutative. However addressing modes on available operands may be:-
    1. Restricted to registers.
    2. Either operand may source data from memory.
    3. Both operands may source data from memory.

There is also the consideration that multiplying m bits × n bits requires approximately m+n bits of answer to avoid overflow. This leads to choices:-

  1. Store result in two specificied registers.
  2. Store result in two sequential registers.
  3. Store result in two registers or one memory location.
  4. Store result in two registers or two memory locations.
  5. Allow overflow.
  6. Restrict input to m/2 bits to preclude overflow.

There are also options where one or more pieces of data are discarded. For example, by "writing" to a null register. This may be dedicated constant register or it may be a register which is (otherwise) unremarkable. There are a similar set of choices for the quotient and remainder of integer division.

The initial release of the MC68000 provided MULS and MULU where all inputs and outputs could be register or memory. However, for very sound reasons, 32 bit arithmetic was initially implemented a two rounds of 16 bit operations and this led to the very obvious design decision to only implement 16 bit multiplication. Despite the unsigned capability and flexible addressing modes, the width limitation was deeply unpopular and led many people to falsely regard the entire range of MC680x0 processors as being 16 bit or only fully compatible with 16 bit implementations.

In RISC design, unsigned multiplication appears to be the obvious choice on the basis that, for example, 32 bit × 32 bit unsigned multiplication can be used to implement 31 bit × 31 bit signed multiplication. (Where do the sign bits go? They get held implicitly in the program counter when the signed multiplication macro executes one of four paths for positive and negative inputs.)

My personal preference is unsigned multiplication with resticted input range and no integer division because this provides the most implementation options and the most upward compatibility while the minimal implementation requires the least gates and energy consumption. However, this is within the context of a design which (provisionally) may source one aligned operand from memory and pre-scales forward branches to cache-line boundaries. (Nominally, 64 bytes.) So, synthesizing signed multiply requires execution paths which form a two tier binary tree. This requires four cache-lines. Each execution path requires two conditional branches (and possible execution pipeline stalls) or backward branches to reduce the number of cache-lines required. The most stream-lined version requires a temporary register and 256 byte of instruction-space. This provides less functionality than a MC68000 while incurring far more clock cycles than most RISC designs. That's unworkable.

We have to make a cut somewhere. We either have multiple memory access cycles, two inputs, two outputs and special 4-address multiplication instructions or some functionality must be discarded.

So, I'll try again. No integer division, signed multiplication with unrestricted input range and no handling of overflow. This allows all variants to be synthesized without branching but each unsigned input over the full range doubles the number of multiplications required. Temporary registers may also be required to assemble the pieces. Overhead can be reduced with moderate abuse of type casting. For example, it is possible to multiply and accumulate 20 bit unsigned × 8 bit unsigned in one instruction (and possibly one clock cycle). The design is otherwise fairly agnostic about signed and unsigned data. At worst, comparison of unsigned integers can be synthesized from signed comparison via use of two temporary registers and zero extra branches.

However, after resolving this ALU functionality (and previously resolving FPU and SIMD functionality), stack and memory addressing modes are completely unresolved. My preference is bit addressable memory or possibly nybble addressable but anything smaller than bytes would be completely incompatible with the vast majority of RAM. Bit addressing also bloats constants and my designs a relatively bad in this regard. Bit addressable offsets don't affect pre-decrement or post-increment but it fails horribly when accessing an element within a structure or when skipping through an array of fixed length structures. From this, a friend made the observation that a stack only has to be aligned to the width of a general register and stack offsets can be implemented portably for any width of register. Specifically, 32 bit code can work with a 64 bit stack while encouraging good code density.

However, a trend popularized by VAX, MIPS, ARM and other processor architectures has been to steal two of the general purpose registers and replace them with a call stack and program counter. This allows MOV to PC to implement branch and post-increment to implement stack pop. (My preference for a zero register allows OR with zero to implement MOV while allowing indirect addressing modes to access a stack hidden under the zero register.) To maintain historical compatibility, performance and code density, some addressing modes don't make sense and some are detrimental. In particular, program counter relative emerges from a lack of position independent code and a failure to separate code and data. While it is possible to define meaningful semantics to the majority of bit permutations, this requires CISC micro-code or extra decode logic. This slows execution. Addressing modes can be offloaded to dedicated load and store instructions but this messes with code density more than ALU throughput. There is also the observation that x86 allows the maximum tricksiness of unaligned, little-endian access. This covers atomic, memory-to-memory conditional swap operations and, in most cases, doesn't cause the Turing complete memory interface to deadlock.

While it has never been advisible to store a 2 byte pointer on an odd memory boundary, use of RISC compatible, open source compilers has allowed aligned storage to become the norm. From Intel Core processors onwards, unaligned access is implemented with penalty, without lawyers suing for deceptive practice and, in most cases, performs read and write to the requested memory locations. (See Intel Core Errata for reasons why NASA uses older Intel processors.)

Among schemes which work correctly, the trend has been to break historical instructions into micro-ops which make one memory access. Where access crosses word, cache or page boundary, two accesses may require two micro-ops. For new designs, opcodes may source one aligned piece of data while performing computation. Dedicated instructions may also load or store aligned data in one instruction or unaligned data in two instructions. For efficiency, code density and historical compatibility, these instructions may implicitly set atomic locks and/or perform post-increment or suchlike. Intel's Itanium only offers pre-decrement in similar circumstances. This may be more useful or may be related to other details of the processor architecture. Regardless, it is trivial to make a register which is also a ripple counter. Also, it is trivial to make a decrementing counter by inverting the inputs and outputs or an incrementing counter. However, it is more cumbersome to make a latch which increments and decrements in a range of steps. Therefore, asymmetry, such as pre-decrement only, is occasionally seen.

I'd resolved much of this already. I just had to be certain that memory-to-memory addressing, such as MC68000's MOV (A0)+,(A1)+ is not a template to follow slavishly. Indeed, a suitable RISC design allows optional unaligned access which exceeds the historical functionality without slowing the common case.

In the spirit of Fred Brooks Junior ("Show me your data structures and I shall be enlightened. Show me flowcharts and I shall remain mystified."), for eight registers, three stacks (system, USR1, USR2) and a stack of program counters, instructions have:-

  • 1 accumulate bit.
  • 1 float bit.
  • 3 other opcode bits.
  • 2 bits for data path width where:-
    • 00: byte.
    • 01: word.
    • 10: long.
    • 11: double.
  • 3 bit destination operand.
  • 3 bit source operand.
  • Another 3 bit source operand.
  • 3 bit pre-scale.
  • 7 bit immediate or offset value.

The first 5 bits define opcodes in groups of four such that:-

  1. Integer multiply.
  2. Integer multiply with accumulate.
  3. Floating point multiply.
  4. Floating point multiply with accumulate.
  5. Integer subtraction.
  6. Integer subtraction with accumulate.
  7. Floating point subtraction.
  8. Floating point subtraction with accumulate.
  9. Unused.
  10. Unused.
  11. Floating point divide.
  12. Floating point divide with accumulate.
  13. Aligned load and store instructions.
  14. Unaligned load and store instructions.
  15. USR1 push and pop instructions.
  16. USR2 push and pop instructions.
  17. Arithmetic shift right.
  18. Arithmetic shift right with accumulate.
  19. Trigonomitric functions.
  20. Trigonomitric functions with accumulate.
  21. Arithmetic shift left.
  22. Arithmetic shift left with accumulate.
  23. Hyperbolic functions.
  24. Hyperbolic functions with accumulate.
  25. Unused.
  26. Unused.
  27. Roots and rounding.
  28. Roots and rounding with accumulate.
  29. Flow control.
  30. Flow control.
  31. Unused.
  32. Unused.

ALU or FPU instruction performs one of:-

  1. a = b op (imm << pre)
  2. a = b op ((imm + c) << pre)
  3. a = b op (([imm + c]) << pre)
  4. a = b op (([imm + c++]) << pre)
  5. a += b op (imm << pre)
  6. a += b op ((imm + c) << pre)
  7. a += b op (([imm + c]) << pre)
  8. a += b op (([imm + c++]) << pre)

You may notice several omissions from the instruction set. For example, as is often defined in C, logical shifts can be derived from arithmetic shifts. Also, OR with zero provides register-to-register MOV. Most importantly, where is addition? There is no 3-address addition which can source from memory. However, register-to-register multiplication is commutative and therefore some addition can be hidden in the multiply instruction. Specifically:-

  • a = b + ((imm + c) << pre)
  • a += b + ((imm + c) << pre)

It is useful to implement:-

  • a = 0
  • a = const
  • a += const
  • a -= const
  • a += a
  • a = b + c
  • a = b * b
  • a += b * b

Of these, handling constants is the most awkward. However, it may be possible to amortize this difficulty if constants are sourced from stack. Fixed quantity left shifts are provided with all ALU operations. With some byte manipulation, this also provides fixed quantity right shifts. See definition of ntohs() for example. (This idiom may be replaced with byte swap or word swap, if available.) Unused bit patterns may be used for AND, OR and XOR. Division and modulo which cannot be implemented as bit operations are implemented with a subroutine. This is standard practice on AVR and ARM. At present, there are no instructions to cast between integers and floating point. However, immediate values are cast appropriately and this may form the basis of conversion subroutines. Epsilon values over a large range can be obtained with use of pre-scaler and floating point division. "Branch never" means call subroutine. "Branch never nowhere" means return from subroutine.

Although the instruction set is not completely orthogonal, omissions can be written as unconditional sequences of instructions which may use temporary variables. These local variables compete for register allocation or stack allocation like any other local variable. Ignoring the asymmetry of register-to-register addition with pre-scale, 2^r registers are completely orthogonal. Register allocation can be shuffled with impunity. However, program length may differ slightly when packed encodings change. When combined with function inlining and a callee saves, stack only call convention, register allocation has no dependencies between functions.

Happy Martin Luther King Day!!!!

Posted by realDonaldTrump on Monday January 15 2018, @11:46PM (#2924)
5 Comments
Topics

Martin Luther King, Jr., was the very model of a modern American patriot.

The Reverend’s devotion to fighting the injustice of segregation and discrimination ignited the American spirit of fraternity and reminded us of our higher purpose.

Dr. King's dream is my dream. It is the American Dream. It's the promise stitched into the fabric of our Nation, etched into the hearts of our people, and written into the soul of humankind. AMERICA FIRST! 45.wh.gov/NhJy55

The Very Finest Flame

Posted by MichaelDavidCrawford on Monday January 15 2018, @08:56PM (#2922)
9 Comments
Career & Education

From https://github.com/atnan/SMJobBlessXPC/issues/7:

Hey Rudy,

It’s refreshing to be proven correct about the nature of your issue. Because I’ve recently discovered a tiny amount of fondness for you (probably brought about by the way you deleted your original posts before responding,) I’m going to try to help you with that problem right here, in this email.

I thought my assessment of your 2 posts was spot on. You can probably go a long way in this world without my help, but for all you know, I might be your current CTO. Or, I may be your next customer. Or in a position to hurt or help your career at some future point.

I’ve learned a lot of lessons in my career in IT. The most important one that I’d like to share with you right now, is “Don’t shit where you eat.”

You act like an immature, testosterone-laden hotshot (maybe), who hasn’t yet learned how to not to behave like a douchebag in public venues.
Or that all actions have consequences.

If I were you, I’d try to get those lessons under your belt sooner, rather than later. You’ll likely find your path through life will be smoother and more secure without an array of brightly burning bridges in your wake.

Cheers!

There is some good news out of this: Rudy apologized later in the thread.

I made a new park & holiday!

Posted by realDonaldTrump on Sunday January 14 2018, @02:55AM (#2920)
1 Comment
Topics

Good news for #MLK fans -- yesterday, it was my great honor to proclaim January 15, 2018, as Martin Luther King Jr., Federal Holiday. That's right, I proclaimed a holiday just for him! Because I think he was a great guy, a great American hero. We have some fine people from Norway, but Martin was one of our own. I encourage all Americans to observe this day with appropriate civic, community, and service activities in honor of Dr. King's EXTRAORDINARY life and his great legacy. pic.twitter.com/samlJsz1Nt

And on Monday it was my great pleasure & honor to sign H.R. 267, the "Martin Luther King, Jr. National Historical Park Act," which redesignates the Martin Luther King, Junior, National Historic Site in the State of Georgia as the Martin Luther King, Jr. National Historical Park. 45.wh.gov/T5ByM5 pic.twitter.com/QTgaqTawPT

God bless you all and God bless America!

NASA Kilopower News Conference on Jan. 18

Posted by takyon on Thursday January 11 2018, @01:31AM (#2916)
2 Comments
Science

NASA, Partners Discuss Power for Future Space Exploration

NASA and its partners will host a news conference at noon EST (9 a.m. PST) Thursday, Jan. 18, at the National Atomic Testing Museum in Las Vegas, to discuss a recent experiment involving a new power source that could provide the safe, efficient and plentiful energy needed for future robotic and human space exploration missions.

Audio of the news conference and presentation slides will stream live on NASA’s website.

Representatives from NASA, the National Nuclear Security Administration’s (NNSA’s) Los Alamos National Laboratory and Nevada National Security Site (NNSS) will discuss and take questions on the Kilopower project, which aims to demonstrate space fission power systems technology that has the potential to enable future crewed surface missions to the Moon, Mars and beyond. Testing began in November 2017 and is expected to continue through March.

Previously: NASA's Kilopower Project Testing a Nuclear Stirling Engine

Good news, folks! I ended global warming! ❄&#x1F30E;❄

Posted by realDonaldTrump on Wednesday January 10 2018, @06:34AM (#2914)
12 Comments
Topics

Folks, President Obama is a big believer in global warming. So he told his people, "look for the global warming, tell everyone about the global warming." And every year they said "last year was the warmest year ever, this is going to be even warmer." But I'm open-minded. I got rid of a lot of Obama FLUNKIES & FLOOZIES. And the folks that are left, I said to them, "global warming is a hoax, maybe, probably, it's a hoax, just tell us about the weather!" So we have a beautiful weather report from my people at NOAA. We had some amazing weather last year in the great states of Florida, Texas, California, Montana, North and South Dakota. Erie, Pennsylvania. And Puerto Rico, which isn't a state. It's an island.

But guess what, it wasn't the warmest year. 2017 wasn't the warmest year ever. Because 2012 and 2014 -- when Obama was in office -- were warmer. I'll tell you, I never promised to end global warming. But I did it. And I did it while I BROUGHT BACK THE JOBS in our terrific coal, oil & gas industries. Like I promised.

And now that we've taken care of global warming, we're going to repeal Obama's Clean Power Plan. Because our power is clean enough. If we make it any cleaner, we'll kill our electric. Our Christmas trees will go dark. And our chandeliers. It won't be pretty, folks. In Africa they like it dark, they can stay in the dark. We're not doing that. We're doing a REPEAL. Administrator Pruitt, very smart guy, is doing the paperwork for that. To Make America Great Again.

And if we get a little global warming, we can use it. We could really use it. Our East, our South and our Central USA were cold as hell last week. Pipes freezing & bursting in Jackson, Mississippi and in Atlanta! And a lot of people freezing to death. We could use some goddamned global warming!

Chinese Witches

Posted by takyon on Tuesday January 09 2018, @01:55AM (#2913)
4 Comments
/dev/random

In rural China, calling someone a 'witch' has serious social consequences

Population structured by witchcraft beliefs (DOI: 10.1038/s41562-017-0271-6) (DX)

Anthropologists have long argued that fear of victimization through witchcraft accusations promotes cooperation in small-scale societies. Others have argued that witchcraft beliefs undermine trust and therefore reduce social cohesion. However, there are very few, if any, quantified empirical examples demonstrating how witchcraft labels can structure cooperation in real human communities. Here we show a case from a farming community in China where people labelled zhu were thought capable of supernatural activity, particularly poisoning food. The label was usually applied to adult women heads of household and often inherited down the female line. We found that those in zhu households were less likely to give or receive gifts or farm help to or from non-zhu households; nor did they have sexual partnerships or children with those in non-zhu households. However, those in zhu households did preferentially help and reproduce with each other. Although the tag is common knowledge to other villagers and used in cooperative and reproductive partner choice, we found no evidence that this assortment was based on cooperativeness or quality. We favour the explanation that stigmatization originally arose as a mechanism to harm female competitors. Once established, fear that the trait is transmissible may help explain the persistence of this deep-rooted cultural belief.