(This is the 49th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
ARM is a mess. Skip details if too complicated. When I see "ARM Cortex", I translate this as "ARM, with the interrupt controller which is incompatible with every other variant of ARM". Indeed, when we say ARM, is that ARM with the 26 bit address space and dedicated zero register, ARM with the 32 bit address space (with or without Java hardware, with or without one of the Thumb modes, with or without one of the SIMD variants) or ARM with 64 bit registers? And that's before we get to ARM licensee variants. So, for example, Atmel ARM processors are configured very much like micro-controllers. I/O pins may be set in groups of 32 to be digital inputs or digital outputs (with optional integral pull-up or pull-down resistors) or over-ridden to provide other functions. Broadcom's BCM8255, used in some models of Raspberry Pi, has a banks of 30 bit registers where I/O pins can be set in groups of 10 and potential be set to one of eight functions. I presume ARM processors from NXP have something completely different.
One of the most consistent aspects of ARM Cortex is that every manufacturer seems to offer 16 interrupt levels. No more. No less. It this regard, ARM Cortex is very similar to a TMS9900. This was a heavy-weight 3MHz mini-computer core which, entertainingly, was also the core of Speak 'N' Spell, as used by E.T. to phone home. This also has similarities to the RCA1802 which is most famous for being used in satellites.
These designs come from a very different age where open-reel tape (digital and analog) was the norm and bubble memory was just around the corner. Can you imagine a system with 2KB RAM and 4MB of solid state storage? That almost happened. Indeed, retro-future fiction based on this scenario would be fantastic.
The RCA1802 architecture is of particular interest. They hadn't quite got the idioms for subroutines. Instead, there was a four bit field to determine which of the 16 × 16 bit general purpose registers would be the program counter. Calling a subroutine involved loading the subroutine address into a register and setting the field so that the register became the new program counter. In this case, execution continued from the specified address. Return was implemented by setting the field to the previous state. This has some merit because leaf subroutines typically use less registers. So, it isn't critical if general purpose registers become progressively filled with return addresses. Indeed, if you want to sort 12 × 16 bit values, an RCA1802 does the job better than a 68000 or an ARM processor in Thumb mode. However, in general, every subroutine has to know the correct bit pattern of the four bit field for successful return. This create a strict hierarchy among subroutines and, in some cases, requires trampolines.
The TMS9900 architecture does many things in multiples of 16. It has a great instruction set which uses the 16 bit instruction space efficiently. It has 16 interrupt levels. And it keeps context switching fast by having 16 × 16 bit registers in main memory, accessed via a workspace pointer. Indeed, there are aspects of this design which seem to have influenced early iterations of SPARC and ARM.
If we mix features of the RCA1802 and the TMS9900, we have a system which responds to interrupts within one instruction cycle. Specifically, if we take a TMS9900 interrupt priority encoder and use it to select an RCA1802 register as a program counter then nested interrupts occurs seemlessly and without delay.
To implement this on a contemporary processor, we would have 16 program counters. Where virtual memory is implemented, we probably want multiple memory maps and possibly a stack pointer within each memory map. However, if all program counters initialize to known addresses after reset, we do not require any read or write access to shadow program counters. Even if a processor is virtualizing itself, by the Popek and Goldberg virtualization requirements, it is possible devise a scheme which never requires access to a shadow program counter.
In the worst case, we incur initial overhead of pushing general registers onto stack. (A sequence which may be nested as interrupt priority increases.) We also incur overhead of restoring general registers. However, rather than loading alternate program counters with an address from a dedicated register or always indirecting from a known memory location, we unconditionally jump one instruction before the interrupt handler and execute a (privileged) yield priority instruction. On a physical processor, this leaves the alternate program counter with the most useful address. On a virtual processor, privilege violation leads to inspection of the instruction and that indicates that the address should be stored elsewhere.
On the first interrupt of a given priority, we require a trampoline to the interrupt handler. But on all subsequent calls, we run the interrupt handler immediately. It typically starts by pushing registers onto a local or global stack but that is because state which is specific to interrupts may be nothing more than a shadow program counter for each interrupt priority level and a global interrupt priority field. No instructions are required to read the internal state of the interrupt system. One special instruction is required to exit interrupt and this is typical on many systems.
(This is the 48th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
Did the Chinese government invent an infinitely scalable processor switch fabric or are the figures from Chinese super-computers a complete fabrication? Indeed, if the Chinese super-computer switch fabric is so good, why does China continue to use Cisco switches for the Great Firewall Of China and why does China intend to use Cisco switches to connect all of its population?
The conventional super-computer design is to have 2^3n nodes in a three dimensional grid. For example, 4096 nodes in a 16×16×16 grid. Each node is connected to three uni-directional data loops. Ignoring network congestion, the round-trip time between any two nodes in a loop is constant. The round-trip time to any two nodes in a plane is constant. And the round-trip time to between arbitrary nodes is constant. In this arrangement, nodes have direct access to each other's memory and it is fairly easy to implement a memory interface which provides equal access to three network links and a local node.
The rate of throughput is obscene. An Intel Xeon processor may have up to 32 cores and each core has two-way hyper-threading. Each thread may consume up to five bytes of instruction per clock cycle and the clock is 4GHz. That's a peak instuction stream execution rate of 1280GB/s per node. For a 4096 node cluster, that's 5PB/s. Memory addressing is also special. 12 or more bits of an address are required merely to specify node number. With 4GB RAM per node, take 32 bit addressing and add another 12 bits for physically addressable RAM.
This arrangement is highly symmetric, highly fragile and rapidly runs into scalability problems. And yet, China just adds a cabinet or two of processor nodes and always retains to world super-computer record. Or adds GPUs to a subset of nodes. (Where does the extra memory interface come from?) Is China using partially connected, two local, six remote hypercube topology? Is there any known upper bound for this switch fabric which is at least five years old?
Assuming China's claims are true, it is possible to make a heterogeneous super-computer cluster with more than 16000 nodes and have at least 1Gb/s bandwidth per node without any significant MTBF. Even the cross-sectional area of first level cache of 16000 MIPS processors creates a significant target for random bit errors. Likewise for ALUs, FPUs and GPUs.
I investigated redundancy for storage and processing. The results are disappointing because best practice is known but rarely followed. For storage, six parity nodes in a cluster is the preferred minimum. Four is rare and zero or one is the typical arrangement. For processing, anything beyond a back-check creates more problems than solutions. Best-of-three is great but it triples energy consumption and rate of processor failure. With storage, it is mirrored at the most abstract level and that many be on different continents. At the very worst, it will be on separate magnetic platters accessed via separate micro-controllers. With processing, redundancy is on the same chip on the same board with the same power supply.
So, for processing, best practice is a twin processor back-check, like the mainframes from the 1970s. For a super-computer cluster, every node should participate in a global check-point and every computation should be mirrored on another node. Errors in parallel computation propagate extremely fast and therefore if any node finds an error, all nodes must wind back to a previous check-point. Checks can be performed entirely in software but it is also possible for a processor with two-way hyper-threading, with two sets of registers and also two ALUs and two FPUs, to run critical tasks in lock-step and throw an exception when results don't match.
Now that I've considered it in detail, it is apparent to me that Intel has never been keen on anything beyond two-way hyper-threading. I just assumed Intel was shipping dark threads to facilitate snooping. ("Hey! You've actually got eight-way hyper-threading but we use six to compress video of you fapping when you think your webcam is off.") But perhaps selected customers get a reliable mode and us plebs get the partial rejects without scalability.
(This is the 47th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
There have been numerous successful processor architectures over many decades but, after MIPS and Alpha fell to the wayside, I assumed we reached the point where it was x86 versus ARM. However, recent event led me to re-evaluate this situation. I've found two additional contenders. The sale of (British) ARM to (Japanese) Softbank may have over-shadowed the subsequent sale of (British) Imagination Technologies to a Chinese sovereign fund. Imagination Technologies have the rights to legacy MIPS architectures but they've most noteworthy for losing a contract because Apple intends to develop its own GPUs rather than license successors to the PowerVR architecture. The drop in value made Imagination Technologies a very worthwhile acquisition for the Chinese government, especially when Chinese (and Russian) super-computers have often been based around MIPS.
However, a third, overlooked Californian architecture could eclipse everything. We've often seen how awful, low-end architectures grow in capability, consume mid-range systems and, more recently, are clustered until they fulfill top-tier requirements. Well, that could be the Xtensa architecture. It is a 16 register, strict two-read, one-write, three-address machine with licensing for 512 bit implementation and numerous optional instructions. Unlike, MIPS, ARM or, in particular, x86, Xtensa instruction decode is ridiculously simple while maintaining good code density. Specifically, the opcode determines if the instuction is two bytes or *three* bytes with no exceptions. Potentially, this can be determined from one (or maybe two) bits of the opcode. Presumably, this requires break-point instructions of two lengths. However, three byte instructions invariably exceed the code density ARM and MIPS while having significantly simpler instruction decode.
At present, the Xtensa architecture is typically surrounded by an accretion of shit but the core CPU is sound. Xtensa processors are most commonly packaged by Expressif Systems in the EY8266 which is then packaged in various tacky micro-conroller boards with, for example, wireless networking of various protocols. Firmware is typically held in an external, I2C serial ROM and is, presumably, paged using base firmware which maintains an LRU cache. Handling of interrupts via paging is extremely slow and this has led to the EY8266 processor being unreliable and a certain method to achieve network time-out. Regardless, this is packaged into IoT devices where making a network connection is a small miracle and security is severely neglected. But don't worry because there are plenty of consultancies who upsell services to cover these deficiencies!
If you dump everything from the serial ROM and outwards, there's an architecture which can out-compete the world's fastest super-computer. I can only approach this efficiency by describing multiple iterations of vaporware. I begin with a few general observations. Firstly, every instruction set is more than 2/3 full from the first iteration. An exception to this rule is the Mostek 6502 instruction set which designed specifically to be an almost pin-compatible, smaller die, higher yield clone of Motorola's 6800 processor. (I would have been *extremely* unamused if I worked at Motorola during this period.) Secondly, it has been known for decades that approximately 1/2 of processor instructions are register moves and approximately 1/3 of processor instructions are conditional operations. Thirdly, it would be logical to assume that code density can be maximized by allocating instruction space in proportion to instruction usage.
An example of these observations is the Z80 instruction set. Ignoring prefix opcodes and alternate sets of registers, there are eight registers. Eight bit opcodes with the two top bits clear encode a move instruction. The remaining six bits of the opcode indicate a three bit source field and a three bit destination field. So, 00-011-100 specifies a move between registers and 1/4 of the instruction space is allocated to these operations. That's fairly optimal for various definitions of optimal. However, one of these eight registers is a flag register and this is rarely a source or destination. Also move to self only sets flags. So, only 42 or the 64 encodings are typically used. And the use of alternates and escapes greatly increases opcode size for the subset of total permitted moves. The Motorola 68000 architecture doubles the size of each opcode field and allows memory addressing modes as a source and/or destination but otherwise works in a similar manner. So, 0000-000:011-000:100 performs a similar task between 68000 data registers but with lower code density. From this, I wondered if it was possible to define an instruction set where the majority of the instruction set was prefix operators - even to the extent that all ALU operations would require a prefix. My interest in this architecture was renewed when AMD devised the x86 VEX prefix. In addition to providing extensible SIMD, it also upgraded x86 from a two-address machine to a three-address machine.
So, it is certainly possible, and even economically viable, to make a micro-processor where the default instruction is a two-address move, one prefix allows a larger range of registers to be addressed, another prefix specifies an ALU operation and another prefix upscales all two-address instructions to three-address instructions. (Have we consumed the instruction space? Probably and then some.) However, there is something really inefficient about this arrangement. Prefixes can be specified in any order. The trivial implementation consumes one clock cycle per prefix while accumulating internal state. However, there is a more fundamental problem related to permutations. Two prefixes have two permutations and therefore have two places in the instruction space. Three prefixes have six permutations and therefore have six places in the instruction space. Four prefixes have 24 permutations and therefore have 24 places in the instruction space. For N prefixes, we lose approximately N-1 bits of instruction space through duplication. That's grossly inefficient.
I moved away from this idea and worked on quad-tree video codecs. From this, I found that it was possible to implement instructions where two or more fields specify operands and none of the fields specify an opcode. An example would be a 16 bit instruction divided into four equal size fields. If the top bit of a field is set then the field represents opcode (instruction). If the top bit of a field is clear then the field represents operand (register reference). From this, we have 16 sets of instructions where a large number of CISC instructions apply to zero, one or two registers and we have a lesser number of instructions which apply to three registers. We also have one implicit instruction which applies to four registers. We can tweak the ratio of registers and instructions. 12, 13 or 14 general purpose registers may be preferable. I continued working on this idea and found that the fields can be divided strictly into inputs and outputs. It is possible to make a comprehensive design which permits 16 bit registers, 32 bit registers or 64 bit registers and has no prefix instructions or internal state between instructions. Donald Knuth's designs and the Itanium architecture show that it is possible to discard a dedicated flag register. I also found that it was trivial to consider a super-scalar design because the outputs of one instruction are trivial to compare against the inputs of the next instruction. However, code density is terrible, especially when handling immediate values. At best, I could achieve six bits of constant per 16 bit instruction - and even this stretches the definition of no intermediate state.
However, further work with codecs and networking found that it is possible to represent instructions of unlimited size using BER or similar. Effectively, the top bit of an eight bit field specifies if a further byte follows. This can be daisy-chained over a potentially unlimited number of bytes. From here, we view the instruction space as a binary fraction of arbitrary precision. (This works like many of the crypto-currencies where, potentially, coin ownership can be divided into arbitrarily small binary fractions.) A proportion of the instruction space may be privileged in a manner which allows a break-point instruction of arbitrary length. We may also have a fence operation which works in a similar manner. We may have addressing modes of arbitrary complexity. However, multiple reads and/or writes per opcode is very likely to be detrimental when used in conjunction with virtual memory. We also retain much of the prefix and super-scalar stuff.
We only get seven bits of instruction per byte and this significantly reduces code density. I considered disallowing one byte instructions and this may be necessary to compete with the Xtensa architecture. In either case, 1/4 of the instruction space provides move operations between general purpose instructions. Other move operations may raise the proportion of move instructions closer to the optimal 1/3. For super-scalar implementation, one unit handles trap instructions, fence instructions and branches. This ensures that all of these operations are handled sequentially. Secondary units are able to handle all ALU and FPU operations. Tertiary units have limited functionality but it is conceivable that eight FPU multiplies may be in progress without having a vector pipeline or out-of-order execution and reconciliation.
If I had unlimited resources, I would investigate this thoroughly. If I immediately required an arbitrary design then I would strongly consider the Xtensa architecture. If I immediately required a chip on a board and didn't care memory integrity then I'd consider an Arduino or Raspberry Pi. If a faster processor or ECC RAM is required then x86 remains the default choice. However, I'd be a fool to ignore ARM, MIPS, Xtensa or future possibilities.
(This is the 46th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
Before PDAs [Personal Digital Assistants] and smartphones, people had paper diaries. Companies like Filofax made a fortune by selling modular stationary. Numerous companies made compatible stationary and it was possible to make your own stationary which fit in Filofax's miniture ring-binders. However, many people preferred to but the official branded products. Yes, before people boasted that they could animate a poo emoji with facial tracking, they would boast about carrying official Filofax graph paper. (Given the ornate designs of Bronze Age daggers, I suspect this type of boasting goes back many millennia.)
I believe that there is a market for a modular bag format in which any party is free to produce interoperable clones. Bag segments are self-contained and nominally cylindrical. All use one yard, YKK, metal tooth zippers. A bag segment may or may not have handles. One or more bag segments may or may not be used with end caps. An end cap may have functionality such as being a detachable toiletry bag.
Bag segments may include duffel bag, racksack straps, cool bag, vinyl LP bag or handbag. A handbag is one or more stubby cylindrical segments with end caps providing carry handles and/or shoulder strap. However, to be interoperable with the other segments, a modular handbag always is always has circular caps with a circumference of one yard.
If you wish to make modular bags then please have one small hole in each of the internal circular panels. This allows USB and/or the cell networking protocol to be threaded through any number of bag segments. Also, please place the hole near the seam of the zipper to minimize problems due to inadvertent ingress of water.
Common cabling, signalling and power allows a modular bag to connect to my ideal car and that can be connected to my ideal house using a protocol which does not tunnel outside of the PAN or LAN.
(This is the 45th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
The budget option for the ideal watch is the Casio F91-W terrorist watch with an additional mode for two factor authentication. It is possible to replace the F91-W motherboard with a custom design. This provides a standard strap, case, battery compartment, splash-proof buttons and screen. I doubt that a custom F91-W would be noticed in typical airport security theater. Regardless, make sure you get the Casio F91-W with blue trim to ensure special airport screening.
(This is the 44th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
The color of my ideal phone should be off-white to match my ideal laptop, tablet and watch. Screen aspect ratio should be 1:1 or 5:4 to match my ideal laptop and watch. Should be clamshell design to match laptop and tablet. Phone should have address book, clock, alarm, phone and walkie-talkie function. No camera. No SMS. No games. No apps and certainly no facial recognition or facial tracking. Off means off. Phone may tether to laptop or tablet over USB. Cell networking protocol does not extend to phone.
(This is the 43rd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
What's wrong with a laptop, you ponces? If you must have a tablet, I suggest a variant of the OLPC dual-screen touch-screen clamshell design but using e-ink on both screens. It could be used like a book or a clamshell laptop but it would be completely incapable of playing video. For that functionality, you are referred to the ideal laptop. However, tablet battery should exceed two weeks on stand-by and eight hours of heavy, continuous use. Tablet does not have any radio interfaces but can tether to ideal phone over USB.
(This is the 42nd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
My ideal laptop is something similar to an old Apple PowerBook or IBM ThinkPad and possibly with a similar level of processing power. It would be an off-white color, clamshell design with a 19 column, full travel, straight run keyboard, a centered track-pad with three buttons which use micro-switches which are easy to replace.
The laptop would have six bays which may be either battery or harddisk. (Sorry, no SSD yet due to concern about long-term storage - or even medium-term storage in hot conditions.)
It is very important to have a power connector which can be inserted in any orientation, unlike many of the numerous, incompatible Apple power connectors. Actually, it may be worthwhile to use Dell or IBM Thinkpad power connectors and power supplies. I presume Apple's clamshell case magnetic catch patent expires soon on the basis that it is deprecated from Apple products. My ideal laptop should have four USB connectors on the *side* of the laptop - not the rear. Should have one genuine serial port not tunneled over USB. No radio interfaces.
For my ideal laptop, I'd strongly consider my own vaporware processor and operating system. However, it must have ECC RAM. For a halfway practical implementation, there is enough space between bays for a Raspberry Pi. The is the least hateful option because motherboards would be available via retailers and motherboards would be completely replaceable in the field. To make field servicing as easy as possible, case should use screws of one length, one width, accessible with flat screwdriver and Philips screwdriver.
Ideal laptop should be off-white to match ideal tablet, phone and watch. Screen aspect ratio should be 1:1 or possibly 5:4 if VGA compatibility is required.
(This is the 41st of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
This is not a troll to tell people to use the command line. When my focus was making better implementations of digital paper, my ideal GUI was a data-centric system where digital lined paper was for writing, digital plain paper was for drawing and digital graph paper was for diagrams and calculations. Sheets could be freely pasted and collated. However, this arrangement is completely incompatible with structured databases, streaming video and process control.
So, as a matter of practicality, my ideal GUI is a mix of twm and Windows95 with extensive tweaks. The most significant change is that all window widgets are at the bottom of a window. This makes the interface agnostic to pointer devices and touch-screens. The typical arrangement of window titles makes explanation of a GUI easier when enumerating from top to bottom. However, for daily use, window captions are more practical than window titles. Likewise, menu lists should spring from the bottom of a window. I am particularly annoyed by the inconsistency of tabs in user interfaces. For example, mIRC has tabs along the bottom whereas Firefox has tabs along the top. And if you use the spyware masquerading as the Chrome web browser, it puts the tabs in the window title.
Next, the clipboard should be a stack rather than a singleton. Some versions of Microsoft Word have nine separate clipboards but this doesn't work outside of the application. A stack of clippings potential allows unlimited items while legacy applications see the top item of the stack. This allows a user to copy multiple items, switch application and then paste in the reverse order. Clipboard functionality may include a firewall. A firewall generally indicates a design failure around a trust boundary. However, a clipboard firewall should prevent accidental pasting between corporate and personal applications.
HDR video and sychronized sound is the default window type. Remoting to other hosts is shown by differing decoration. For example, corporate colors for one window and pink for personal stuff. Restrictions may prevent clipboard pasting to and from corporate applications.
Multiple video planes allow QNX Photon functionality like tint, blur, fog, ripple and displace without leaking data between applications or event latency over network. Furthermore, this abstraction allows one window to appear on multiple desktops. It is possible to share a window among multiple users, possibly with time limits and possibly with read-only access. With device pairing, it is possible for a flick gesture to move or copy a window reference from phone to wall-mounted screen.
I've previously mentioned that it is possible to implement a text console within HDR video; possible to get HDR video through UDP; possible to get UDP through 24 byte cell networking; possible to get 24 byte cells through network nodes with 2KB RAM; and possible for network nodes with 96KB RAM or less to work with USB and/or VGA. Although data passing through micro-controllers at kilobit speed may be agonizingly slow, it is possible to have a fancy full-screen video system which is compatible with a trustworthy section which may provide industrial control, hydroponics, wine brewing or beer brewing (with alcohol yeast and/or opiate yeast). Between writing letters to grandma and watching films on the home entertainment system, it is possible to check the garden, wine and beer. Potentially, this could be one or more menu items in Kodi's menu hierarchy. However, this arrangement would compromise security.
(This is tangentially related to other topics and is here for completeness.)
A wise man said to always know the location of your towel. It is next most important to know the location of your plushie. Mant people fail in this regard. In addition to being a source of extra warmth, a fluffy pillow with a face is an extra pillow. Regardless, there is something about staring into a little face which oddly effective for certain stretching exercises. (This is relatively safe and sane compared to people who incorporate a baby into their exercise routine.)
Anyhow, my ideal plushie is about 12 inches tall (30cm) with stubby, chibi arms and legs. It is possible to design material templates of a common base plushie and then mae variants for bear, rabbit, cat, dog, fox, raccoon, badger, lion and suchlike. For example, bear and rabbit have a common short tail. Ears would either be long rabbit ears, short pointy ears or short round ears. All have a common body and paws and a white under-belly. Given common head size, Build-A-Bear headgear may fit but clothing may not fit.