Stories
Slash Boxes
Comments

SoylentNews is people

Log In

Log In

Create Account  |  Retrieve Password


☠️RADICAL ISLAMIC TERRORIST from #Uzbekistan kills 8 in NYC!

Posted by realDonaldTrump on Wednesday November 01 2017, @07:37AM (#2738)
15 Comments
Topics

☠️🚴☠️🚶💥🚚 ISIS terrorist in truck rammed people walking & bicycling in Lower Manhattan. Injured 12 & killed 8. 😈💥🔫👮 Ryan Nash of NYPD shot bad guy. 😈🔒👮 Terrorist in custody. ✈ Came from #Uzbekistan on a green card! 🏥🚑👮🙉👂 Officer Nash in hospital for ringing in ears. 🔇💥🔫✔ Unfair tax on suppressors! PIX11News/status/925444943912620032 pic.twitter.com/F0TexToiXd pic.twitter.com/nWhrxlqKFY #ManhattanAttack ?? #NYCStrong

Fox News Employees Know Their Mueller/Russia Coverage Stinks

Posted by takyon on Wednesday November 01 2017, @01:56AM (#2737)
12 Comments

Video Codec: Run Length Encoding

Posted by cafebabe on Tuesday October 31 2017, @04:01PM (#2735)
4 Comments
Software

(This is the 53rd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I've been working on a video encoder and decoder which divides frames of video into square tiles and then sub-divides those tiles and applies a specified short-list of matching and compression algorithms to tiles of 256×256 pixels or significantly smaller. What real-time conferencing? Stick to MJPEG tile-types. Want lossless video? Avoid affine tile-types. Want maximum resilience? Minimize use of key-frames. Want best compression? Use everything.

In addition to viewing real-time or pre-recorded video, this arrangement can be used to display contemporary or legacy software which is running elsewhere. Indeed, it is likely to be used extensively for remoting LibreOffice Writer, Calc and Draw.

So, the video codec is likely to be used in scenarios where the majority of a window is plain white and the exceptions have a hard edge. I presumed that the patent-just-expired texture compression algorithms would handle this with ease. However, an ex-housemate suggested inclusion of the simplest historial method. Well, I'm not a complete misanthropic curmudgeon and I do listen to my friends so implemented a Run Length Encoding tile-type. Although, I wish that I hadn't.

Run length encoding schemes differ significantly and, in the case of the Amiga's IFF picture compression, may be defined ambiguously. A typical byte encoding would allow up to 128 repetitions of a following value or up to 128 literal values to follow. However, the cut could be defined anywhere. So, it could define 240 repetitions and 16 literal values. Or the opposite. There's no point defining one repetition and one literal because that's the same thing. Likewise for zero. So, definitions may be shifted by one or two values.

Should Run Length Encoding start afresh on each line? Implementations vary. The next problem is more pernicious. What is a repetition? If there is sequence of pixels and two pixels are the same, should an encoder algorithm cease encoding literals, encode a repetition of two values and then continue encoding literals? For 24 bit color, yes. For 8 bit monochrome, no. For 16 bit data, it depends. So, a dumb Run Length Encoder has to look ahead by one pixel. In some circumstances, a smart Run Length Encoder may have to look further ahead.

Stuff like this caused me to implement an assertion system inside my buffer structures. Specifically, there is a compilation directive which enables a shadow buffer with a type system. Arguably, it should be enabled by default but with intended usage of 3840×2160 pixel video divided into 64×64 pixel tiles and each of the 2040 tiles requiring multiple 4KB buffers, data-types on buffers would require a large amount of extra memory.

However, I've yet to get to the best part. I remember exactly why I didn't implement a Run Length Encoding tile-type. The RMS [Root Mean Square] error (or smarter variant) for Run Length Encoding is always zero. Therefore, when I include Run Length Encoding, the encoder invariantly chooses Run Length Encoding for every part of a video frame. Even if the choice metric is set to error divided by encode length, the result remains zero.

Run Length Encoding greatly improves quality but, also, it greatly increased encoding size. Attempts to restrict matches have been mixed. I've tried setting a minimum tile size and a maximum number of tokens per tile. However, it is easier to exclude it from the encoding process. This experience has made me slightly more of a misanthropic curmudgeon and I'm less inclined to take advice from people who know very little about codecs.

Bit Matrix Transpose, Part 2

Posted by cafebabe on Tuesday October 31 2017, @03:48PM (#2734)
3 Comments
Code

(This is the 52nd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I posted example code to perform an 8×8 bit matrix transpose. This is intended for multiple applications including driving multiple serial lines for a network protocol, driving multiple serial lines for I2C, driving multiple serial DACs and bit-plane audio and video compression. For this reason, I want something which is fast, flexible and reliable. I don't want to keep coming back to it. I certainly don't want to break or fork it. Although the example code is supposed to be optimized for eight bit processors, 16 bit processors, 32 bit processors and 64 bit processors, diassembly of 32 bit ARM GCC output found that the results were dismal. I am also disappointed by the amount of effort required to correct this deficiency.

I'm using a failing x86 laptop as a dumb terminal for a Raspberry Pi which is off-line and has a fixed configuration for the duration of the development cycle. Therefore, my first use of objdump -d to disassemble GCC output was applied to 32 bit ARM code on a Raspberry Pi. And the result is horrible. There are hidden functions prior to main(). That is known. But do they have to be trampoline functions? Is this related to ARM Thumb mode? Who knows. The compiled code is also quite laborious with stacks and stack frames. With hindsight, this would be the focus of my optimization. I'm aware of inlining and a large number of other compiler optimization techniques. However, with more streamlined stack conventions, there would be less code and less overhead to amortize when performing many of these optimizations. I suppose this is part of the "Why the blazes did the compiler do that???" response to optimization.

Anyhow, I'm optimizing code size on the basis that I should have unrolled loops and every instruction takes one clock cycle to execute. Therefore, smaller code means faster execution and less energy consumption. There are numerous cases where smaller code is contrary to faster execution. There are also numerous cases where instruction count is not linear with execution time. However, on this architecture, for this subroutine, it is possible to equate length with speed as a first approximation.

It helps if a desirable result is known. On 32 bit ARM, there is a four bit conditional execution field in every instruction. This allows ARM implementations to have a very simple execution unit and avoids instruction decode pipeline stalls when jumping over one or two instructions. It is faster and easier to take the hit of one wasted instruction slot. Unfortunately, the consequence of 32 bit instructions with a four bit conditional field is that code density is terrible.

Anyhow, it would assumed that 32 bit ARM has a register bit test instruction or a method of performing an AND mask operation with a short, immediate value. Being a three-address machine, it would probably have a method of doing this non-destructively. Although, with a four bit conditional field and a four bit pre-scale field, short, immediate values may be mutually exclusive with a third register reference which would be required for a non-destructive test.

Even if the compiler's input was the dumb technique for an 8×8 bit transpose, the output should be a sequence of 128 instructions or maybe 192 instructions plus some instructions to clear registers, marshall data into two of the eight 32 bit registers plus all of that stack frame nonsense. So, what did the "optimized" code produce? About 262 instructions. It this point, I felt like the commedian, Doug Stanhope, when he looks in the mirror and says "Well, that ain't right."

I (thankfully) don't read ARM assembly but a large number of stack references and square brackets indicated that GCC documentation claims about register allocation are complete bunkum. Again, it helps if desirable operation is known. With 64 bits of input packed into two 32 bit registers, the same for output plus one or two working registers, the majority of stack references would be avoided if five or six general purpose registers were available. However, eight char pointers for input and eight char pointers for output, provided as function parameters, had taken precedent over the more frequent operations.

I really wanted to keep the input pointers because I wanted to use the 8×8 bit transpose as a building block for a 12×8 transpose or suchlike for 12 bit DACs. After a re-write (times four due to premature optimization), output performed post-increment on one pointer. This improved code size but wasn't sufficient. Reluctantly, input is also obtained via post-increment of one pointer. This frees sufficient general purpose registers and the compiled code minus stack overhead is about 80 instructions.

Unfortunately, I was less than halfway through this problem. After getting sensible output on 32 bit ARM, I repeated tests on 16 bit Thumb instructions, as used in an Arduino Due, and 16 bit AVR instructions, as used on the majority of Arduino micro-controllers. After about three days, I had something which compiled reliably on x86, 32 bit ARM instructions, 16 bit ARM instructions and AVR. I have no reason to believe that it performs badly on 6502, Z80, 68000, PowerPC, MIPS or any mainstream architecture. However, I'm astounded that it took so long. I got to the point where the default command was functionally equivalent to:-

objdump -d a.out | wc -l

with the occasional back-check to make sure that I wasn't inadvertently optimizing an unrolled loop. (Yes, many hours of fun there.)

After understanding the dumbness of a compiler in this particular process, I devised a method to implement the partial 8×8 bit transpose functions in a fairly efficient manner. Indeed, this can be reduced to a macro which defines a C function. The function has local byte buffers for input and output. Conceptually, the input buffer is cleared and then data is selectively copied into the local input buffer. A call is made to an 8×8 bit transpose and then data is selectively copied from the local output buffer. The compiler is able to unroll the known quantity of clear and copy operations to virtual slots on the data stack. It is also able to eliminate redundant clear operations. Most impressively, it is able to eliminate all operations involving dummy inputs and outputs. The partial bit transpose functions are proportionately smaller than the complete 8×8 bit transpose. Unfortunately, compilation is relatively slow and this is worsened when it is trivial to define many variants of bit transpose function.

So, it is possible to optimize bit transpose functions by minimizing parameter count and by creating data trampolines which are optimized away by a compiler.

Unfortunately, I hoped that a full 8×8 bit transpose for ARM Thumb mode would compile to 40 instructions and with minimal escapes to longer instructions. The result was 80 instructions which mostly of the extended format. This is disappointing. At a minimum, this consumes twice the intended processing budget and this assumes there is no execution penalty for mis-aligned instructions. However, I'm targetting a device which has twice the requested processing power. So, I've soaked available resources and possibly eaten an extra 40% into the processing budget. It may remain desirable to write assembly implementations for one or more architectures. However, I've found a reliable but tedious method to avert this situation in some cases.

My Ideal Database, Part 1

Posted by cafebabe on Tuesday October 31 2017, @03:40PM (#2733)
0 Comments
Software

(This is the 51st of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I've described an outline of my ideal filing system. My ideal database allows generalized use of the full-text search facility. This requires an outline of proposed volume storage. This borrows widely from ReiserFS, ZFS and MySQL Server storage engines such as InnoDB, Nitro and NDB Cluster.

Media is striped into 512MB fragments and each unit has a map of stripe allocation types where one type is no allocation and another type is bad stripe. It is envisioned that six out of 67 stripes perform parity functions and this is rotated in a procession across units. Each stripe has a bad sector map. For 1KB sectors, this requires 64KB. For 4KB sectors, this requires 16KB. If these sectors are bad, the stripe cannot be used. If the stripe map is bad, the unit cannot be used.

The remainder of sectors within a stripe are available to an application. However, applications may not be expecting raw, disjoint storage and therefore a standard contiguous mapping with redundancy may be utilized. The full-text search for the filing system utilizes a specialized database in which search terms are striped by length with attributes such as capitalization and accents. Conceptually, "House++" would be stored as HOUSE-15-10000-00000 where digits represent punctuation, capitalization and accented characters. Sequential entries would be compressed into fragments occupying 13 or 21 sequential sectors and would be split or reconciled as required.

The general database storage requires three or more 512MB stripes per table. One or more stripes hold 13 sector fragments. One or more stripes hold 21 sector fragments. One or more stripes hold the index of fragments. All rows within a table are stored in N-dimensional Peano curve format and therefore the table is its own universal index. Sets of eight rows are bit transposed, Peano mixed and arithmetic compressed into one stream. If a fragment exceeds 13 sectors, it is placed into a 21 sector fragment. If a fragment exceeds 21 sectors, it is placed into two 13 sector fragments. All CHAR and VARCHAR fields which are longer than 13 bytes are stored in shadow tables which require their own 512MB stripes. Each definition of VARCHAR(255) requires several gigabytes of space. BLOB fields are stored in an unindexed Fibonacci filing system.

If you doubt the wisdom of chopping data so finely then please investigate the sort utility used in a previous project.

My Ideal Filing System, Part 1

Posted by cafebabe on Tuesday October 31 2017, @03:35PM (#2732)
0 Comments
Software

(This is the 50th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I have an idea which is based upon Google Search Cache and EMC storage units. If it is possible to de-duplicate files on a global scale then it may be possible to obtain a global scale search engine as a corollary. Unfortunately, my developments in this field have been amusingly dire. I've been unable to implement a de-duplicated filing with write throughput exceeding 11KB/s. However, said filing system did not store a JPEG quantize table more than once. Nor did it store two copies of the same file in a PKZip archive.

A considerably simplified version may incur read fragmentation but would vastly superior write and replication throughput. Even a trivial test system would scale beyond the affordability of many organizations. There are various techniques for filing system de-duplication. Some of these techniques even apply to memory de-duplication used in virtual desktop or virtual server systems.

The simplest technique is to not store any file with a duplicate checksum. Some legal and medical systems use this technique because it has the advantage of providing some tamper-proofing. If a file is modified then its checksum changes and therefore it is not the oroginal file. Dropbox and MegaUpload used a similar technique and this explains the inability to maintain privileges within these systems. If you have you obtain the checksum of a file by any means then revocation of access is contrary to HTTP caching.

Block-level de-duplication works on fixed boundaries; typically 2^n for suitably large n. Linux Volume Management defaults to 4MB blocks. Anything above 256MB is recommended to maintain video or database throughput. However, I discovered a weird result in which block size is immaterial. If block size is very large than it is indistinguishable from file de-duplication. If block size is very small then fragmentation may adversely affect read or write speed. However, in the general case, it does not matter.

Byte-level de-duplication may use propietary algorithms or dedicated hardware to O(n^2) matches over 1KB data or more. This is quite energy intensive but it provides the tightest matches. It also provides the most devastating failure modes because the system has minimal redundancy.

After considering byte-level de-duplication techniques, Huffman compression and various other storage techniques (in memory and on media), I believe that Fibonacci lengths should be seriously considered. If we go through the standard Fibonacci sequence (variants exist and may be useful), we have: 1, 1, 2, 3, 5, 8, 13, 21 and 34. My proposal is that files can be stored exclusively in 13 byte fragments and 21 byte fragments. Furthermore, every fragment can be given an eight byte handle where that handle contains or is augmented with a file server cluster number and fragment checksum. With five bytes or more of contiguous fragment numbers, a malicious user who is trying to exhaust all allocations would require 13*2^40 bytes of storage. This requires a 13TB file quota. With a less rigorous checksum or checksums external to the fragment reference, the required storage is significantly larger.

In typical usage, read and write operations will be large sequential operations with minor fragmentation for data which is more commonly stored. In the worst case, unique data incurs considerable overhead. Indeed, if you only have one or two copies of data then mis-alignment by one byte incurs a duplicate copy with additional fragmentation overhead. However, by random walks and birthday paradox, savings occur rapidly. Four or five copies are likely to achieve alignment. And 22 copies of data under any alignment must attain some savings. So, if you're in an office where 500 people get CCed then you can be certain that a maximum of 21 copies of the data will be stored on disk.

In a clustered environment, there is the possibility that two nodes discover the same fragment of data independently. In the general case, this is of no consequence. After replication, one reference takes precendence and the other falls into obsolescence. To ensure load balancing, precendance may be determined by checksum prior to node ID. Stripping out stale references is an interesting exercise in a clustered environment. Although, a general solution is likely to encourage wear-leveling.

However, the benefit is the ability to perform full-text indexing. Considering seven bit ASCII first, each 13 byte fragment and 21 byte fragment has an upper bound for the number of words which do not occur adjacent to a fragment boundary. Likewise, zero or one words span consecutive fragments. (Words which are longer than 13 bytes may be awkward.) Latin1, UTF-8 and UCS2 are more awkward, especially if it contains long fragments of Japanese or suchlike. Regardless, it is possible to index such text unambiguously after a file has been closed.

All of this can be dumped into a full-text index and searched in a manner which enforces file and directory privileges. It is possible (but not easy) to perform a database join in a manner which maintains partial ordering of resource IDs. It is also possible to perform this in a manner which allows significant fragment ID compression. This should scale beyond 1EB (2^60 bytes) but I may be optimistic.

If you doubt the wisdom of chopping data so finely then please investigate the sort utility used in a previous project.

My Ideal Processor, Part 3

Posted by cafebabe on Tuesday October 31 2017, @03:29PM (#2731)
0 Comments
Hardware

(This is the 49th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

ARM is a mess. Skip details if too complicated. When I see "ARM Cortex", I translate this as "ARM, with the interrupt controller which is incompatible with every other variant of ARM". Indeed, when we say ARM, is that ARM with the 26 bit address space and dedicated zero register, ARM with the 32 bit address space (with or without Java hardware, with or without one of the Thumb modes, with or without one of the SIMD variants) or ARM with 64 bit registers? And that's before we get to ARM licensee variants. So, for example, Atmel ARM processors are configured very much like micro-controllers. I/O pins may be set in groups of 32 to be digital inputs or digital outputs (with optional integral pull-up or pull-down resistors) or over-ridden to provide other functions. Broadcom's BCM8255, used in some models of Raspberry Pi, has a banks of 30 bit registers where I/O pins can be set in groups of 10 and potential be set to one of eight functions. I presume ARM processors from NXP have something completely different.

One of the most consistent aspects of ARM Cortex is that every manufacturer seems to offer 16 interrupt levels. No more. No less. It this regard, ARM Cortex is very similar to a TMS9900. This was a heavy-weight 3MHz mini-computer core which, entertainingly, was also the core of Speak 'N' Spell, as used by E.T. to phone home. This also has similarities to the RCA1802 which is most famous for being used in satellites.

These designs come from a very different age where open-reel tape (digital and analog) was the norm and bubble memory was just around the corner. Can you imagine a system with 2KB RAM and 4MB of solid state storage? That almost happened. Indeed, retro-future fiction based on this scenario would be fantastic.

The RCA1802 architecture is of particular interest. They hadn't quite got the idioms for subroutines. Instead, there was a four bit field to determine which of the 16 × 16 bit general purpose registers would be the program counter. Calling a subroutine involved loading the subroutine address into a register and setting the field so that the register became the new program counter. In this case, execution continued from the specified address. Return was implemented by setting the field to the previous state. This has some merit because leaf subroutines typically use less registers. So, it isn't critical if general purpose registers become progressively filled with return addresses. Indeed, if you want to sort 12 × 16 bit values, an RCA1802 does the job better than a 68000 or an ARM processor in Thumb mode. However, in general, every subroutine has to know the correct bit pattern of the four bit field for successful return. This create a strict hierarchy among subroutines and, in some cases, requires trampolines.

The TMS9900 architecture does many things in multiples of 16. It has a great instruction set which uses the 16 bit instruction space efficiently. It has 16 interrupt levels. And it keeps context switching fast by having 16 × 16 bit registers in main memory, accessed via a workspace pointer. Indeed, there are aspects of this design which seem to have influenced early iterations of SPARC and ARM.

If we mix features of the RCA1802 and the TMS9900, we have a system which responds to interrupts within one instruction cycle. Specifically, if we take a TMS9900 interrupt priority encoder and use it to select an RCA1802 register as a program counter then nested interrupts occurs seemlessly and without delay.

To implement this on a contemporary processor, we would have 16 program counters. Where virtual memory is implemented, we probably want multiple memory maps and possibly a stack pointer within each memory map. However, if all program counters initialize to known addresses after reset, we do not require any read or write access to shadow program counters. Even if a processor is virtualizing itself, by the Popek and Goldberg virtualization requirements, it is possible devise a scheme which never requires access to a shadow program counter.

In the worst case, we incur initial overhead of pushing general registers onto stack. (A sequence which may be nested as interrupt priority increases.) We also incur overhead of restoring general registers. However, rather than loading alternate program counters with an address from a dedicated register or always indirecting from a known memory location, we unconditionally jump one instruction before the interrupt handler and execute a (privileged) yield priority instruction. On a physical processor, this leaves the alternate program counter with the most useful address. On a virtual processor, privilege violation leads to inspection of the instruction and that indicates that the address should be stored elsewhere.

On the first interrupt of a given priority, we require a trampoline to the interrupt handler. But on all subsequent calls, we run the interrupt handler immediately. It typically starts by pushing registers onto a local or global stack but that is because state which is specific to interrupts may be nothing more than a shadow program counter for each interrupt priority level and a global interrupt priority field. No instructions are required to read the internal state of the interrupt system. One special instruction is required to exit interrupt and this is typical on many systems.

My Ideal Processor, Part 2

Posted by cafebabe on Tuesday October 31 2017, @03:11PM (#2730)
0 Comments
Hardware

(This is the 48th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

Did the Chinese government invent an infinitely scalable processor switch fabric or are the figures from Chinese super-computers a complete fabrication? Indeed, if the Chinese super-computer switch fabric is so good, why does China continue to use Cisco switches for the Great Firewall Of China and why does China intend to use Cisco switches to connect all of its population?

The conventional super-computer design is to have 2^3n nodes in a three dimensional grid. For example, 4096 nodes in a 16×16×16 grid. Each node is connected to three uni-directional data loops. Ignoring network congestion, the round-trip time between any two nodes in a loop is constant. The round-trip time to any two nodes in a plane is constant. And the round-trip time to between arbitrary nodes is constant. In this arrangement, nodes have direct access to each other's memory and it is fairly easy to implement a memory interface which provides equal access to three network links and a local node.

The rate of throughput is obscene. An Intel Xeon processor may have up to 32 cores and each core has two-way hyper-threading. Each thread may consume up to five bytes of instruction per clock cycle and the clock is 4GHz. That's a peak instuction stream execution rate of 1280GB/s per node. For a 4096 node cluster, that's 5PB/s. Memory addressing is also special. 12 or more bits of an address are required merely to specify node number. With 4GB RAM per node, take 32 bit addressing and add another 12 bits for physically addressable RAM.

This arrangement is highly symmetric, highly fragile and rapidly runs into scalability problems. And yet, China just adds a cabinet or two of processor nodes and always retains to world super-computer record. Or adds GPUs to a subset of nodes. (Where does the extra memory interface come from?) Is China using partially connected, two local, six remote hypercube topology? Is there any known upper bound for this switch fabric which is at least five years old?

Assuming China's claims are true, it is possible to make a heterogeneous super-computer cluster with more than 16000 nodes and have at least 1Gb/s bandwidth per node without any significant MTBF. Even the cross-sectional area of first level cache of 16000 MIPS processors creates a significant target for random bit errors. Likewise for ALUs, FPUs and GPUs.

I investigated redundancy for storage and processing. The results are disappointing because best practice is known but rarely followed. For storage, six parity nodes in a cluster is the preferred minimum. Four is rare and zero or one is the typical arrangement. For processing, anything beyond a back-check creates more problems than solutions. Best-of-three is great but it triples energy consumption and rate of processor failure. With storage, it is mirrored at the most abstract level and that many be on different continents. At the very worst, it will be on separate magnetic platters accessed via separate micro-controllers. With processing, redundancy is on the same chip on the same board with the same power supply.

So, for processing, best practice is a twin processor back-check, like the mainframes from the 1970s. For a super-computer cluster, every node should participate in a global check-point and every computation should be mirrored on another node. Errors in parallel computation propagate extremely fast and therefore if any node finds an error, all nodes must wind back to a previous check-point. Checks can be performed entirely in software but it is also possible for a processor with two-way hyper-threading, with two sets of registers and also two ALUs and two FPUs, to run critical tasks in lock-step and throw an exception when results don't match.

Now that I've considered it in detail, it is apparent to me that Intel has never been keen on anything beyond two-way hyper-threading. I just assumed Intel was shipping dark threads to facilitate snooping. ("Hey! You've actually got eight-way hyper-threading but we use six to compress video of you fapping when you think your webcam is off.") But perhaps selected customers get a reliable mode and us plebs get the partial rejects without scalability.

My Ideal Processor, Part 1

Posted by cafebabe on Tuesday October 31 2017, @02:55PM (#2729)
3 Comments
Hardware

(This is the 47th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

There have been numerous successful processor architectures over many decades but, after MIPS and Alpha fell to the wayside, I assumed we reached the point where it was x86 versus ARM. However, recent event led me to re-evaluate this situation. I've found two additional contenders. The sale of (British) ARM to (Japanese) Softbank may have over-shadowed the subsequent sale of (British) Imagination Technologies to a Chinese sovereign fund. Imagination Technologies have the rights to legacy MIPS architectures but they've most noteworthy for losing a contract because Apple intends to develop its own GPUs rather than license successors to the PowerVR architecture. The drop in value made Imagination Technologies a very worthwhile acquisition for the Chinese government, especially when Chinese (and Russian) super-computers have often been based around MIPS.

However, a third, overlooked Californian architecture could eclipse everything. We've often seen how awful, low-end architectures grow in capability, consume mid-range systems and, more recently, are clustered until they fulfill top-tier requirements. Well, that could be the Xtensa architecture. It is a 16 register, strict two-read, one-write, three-address machine with licensing for 512 bit implementation and numerous optional instructions. Unlike, MIPS, ARM or, in particular, x86, Xtensa instruction decode is ridiculously simple while maintaining good code density. Specifically, the opcode determines if the instuction is two bytes or *three* bytes with no exceptions. Potentially, this can be determined from one (or maybe two) bits of the opcode. Presumably, this requires break-point instructions of two lengths. However, three byte instructions invariably exceed the code density ARM and MIPS while having significantly simpler instruction decode.

At present, the Xtensa architecture is typically surrounded by an accretion of shit but the core CPU is sound. Xtensa processors are most commonly packaged by Expressif Systems in the EY8266 which is then packaged in various tacky micro-conroller boards with, for example, wireless networking of various protocols. Firmware is typically held in an external, I2C serial ROM and is, presumably, paged using base firmware which maintains an LRU cache. Handling of interrupts via paging is extremely slow and this has led to the EY8266 processor being unreliable and a certain method to achieve network time-out. Regardless, this is packaged into IoT devices where making a network connection is a small miracle and security is severely neglected. But don't worry because there are plenty of consultancies who upsell services to cover these deficiencies!

If you dump everything from the serial ROM and outwards, there's an architecture which can out-compete the world's fastest super-computer. I can only approach this efficiency by describing multiple iterations of vaporware. I begin with a few general observations. Firstly, every instruction set is more than 2/3 full from the first iteration. An exception to this rule is the Mostek 6502 instruction set which designed specifically to be an almost pin-compatible, smaller die, higher yield clone of Motorola's 6800 processor. (I would have been *extremely* unamused if I worked at Motorola during this period.) Secondly, it has been known for decades that approximately 1/2 of processor instructions are register moves and approximately 1/3 of processor instructions are conditional operations. Thirdly, it would be logical to assume that code density can be maximized by allocating instruction space in proportion to instruction usage.

An example of these observations is the Z80 instruction set. Ignoring prefix opcodes and alternate sets of registers, there are eight registers. Eight bit opcodes with the two top bits clear encode a move instruction. The remaining six bits of the opcode indicate a three bit source field and a three bit destination field. So, 00-011-100 specifies a move between registers and 1/4 of the instruction space is allocated to these operations. That's fairly optimal for various definitions of optimal. However, one of these eight registers is a flag register and this is rarely a source or destination. Also move to self only sets flags. So, only 42 or the 64 encodings are typically used. And the use of alternates and escapes greatly increases opcode size for the subset of total permitted moves. The Motorola 68000 architecture doubles the size of each opcode field and allows memory addressing modes as a source and/or destination but otherwise works in a similar manner. So, 0000-000:011-000:100 performs a similar task between 68000 data registers but with lower code density. From this, I wondered if it was possible to define an instruction set where the majority of the instruction set was prefix operators - even to the extent that all ALU operations would require a prefix. My interest in this architecture was renewed when AMD devised the x86 VEX prefix. In addition to providing extensible SIMD, it also upgraded x86 from a two-address machine to a three-address machine.

So, it is certainly possible, and even economically viable, to make a micro-processor where the default instruction is a two-address move, one prefix allows a larger range of registers to be addressed, another prefix specifies an ALU operation and another prefix upscales all two-address instructions to three-address instructions. (Have we consumed the instruction space? Probably and then some.) However, there is something really inefficient about this arrangement. Prefixes can be specified in any order. The trivial implementation consumes one clock cycle per prefix while accumulating internal state. However, there is a more fundamental problem related to permutations. Two prefixes have two permutations and therefore have two places in the instruction space. Three prefixes have six permutations and therefore have six places in the instruction space. Four prefixes have 24 permutations and therefore have 24 places in the instruction space. For N prefixes, we lose approximately N-1 bits of instruction space through duplication. That's grossly inefficient.

I moved away from this idea and worked on quad-tree video codecs. From this, I found that it was possible to implement instructions where two or more fields specify operands and none of the fields specify an opcode. An example would be a 16 bit instruction divided into four equal size fields. If the top bit of a field is set then the field represents opcode (instruction). If the top bit of a field is clear then the field represents operand (register reference). From this, we have 16 sets of instructions where a large number of CISC instructions apply to zero, one or two registers and we have a lesser number of instructions which apply to three registers. We also have one implicit instruction which applies to four registers. We can tweak the ratio of registers and instructions. 12, 13 or 14 general purpose registers may be preferable. I continued working on this idea and found that the fields can be divided strictly into inputs and outputs. It is possible to make a comprehensive design which permits 16 bit registers, 32 bit registers or 64 bit registers and has no prefix instructions or internal state between instructions. Donald Knuth's designs and the Itanium architecture show that it is possible to discard a dedicated flag register. I also found that it was trivial to consider a super-scalar design because the outputs of one instruction are trivial to compare against the inputs of the next instruction. However, code density is terrible, especially when handling immediate values. At best, I could achieve six bits of constant per 16 bit instruction - and even this stretches the definition of no intermediate state.

However, further work with codecs and networking found that it is possible to represent instructions of unlimited size using BER or similar. Effectively, the top bit of an eight bit field specifies if a further byte follows. This can be daisy-chained over a potentially unlimited number of bytes. From here, we view the instruction space as a binary fraction of arbitrary precision. (This works like many of the crypto-currencies where, potentially, coin ownership can be divided into arbitrarily small binary fractions.) A proportion of the instruction space may be privileged in a manner which allows a break-point instruction of arbitrary length. We may also have a fence operation which works in a similar manner. We may have addressing modes of arbitrary complexity. However, multiple reads and/or writes per opcode is very likely to be detrimental when used in conjunction with virtual memory. We also retain much of the prefix and super-scalar stuff.

We only get seven bits of instruction per byte and this significantly reduces code density. I considered disallowing one byte instructions and this may be necessary to compete with the Xtensa architecture. In either case, 1/4 of the instruction space provides move operations between general purpose instructions. Other move operations may raise the proportion of move instructions closer to the optimal 1/3. For super-scalar implementation, one unit handles trap instructions, fence instructions and branches. This ensures that all of these operations are handled sequentially. Secondary units are able to handle all ALU and FPU operations. Tertiary units have limited functionality but it is conceivable that eight FPU multiplies may be in progress without having a vector pipeline or out-of-order execution and reconciliation.

If I had unlimited resources, I would investigate this thoroughly. If I immediately required an arbitrary design then I would strongly consider the Xtensa architecture. If I immediately required a chip on a board and didn't care memory integrity then I'd consider an Arduino or Raspberry Pi. If a faster processor or ECC RAM is required then x86 remains the default choice. However, I'd be a fool to ignore ARM, MIPS, Xtensa or future possibilities.

My Ideal Bag

Posted by cafebabe on Tuesday October 31 2017, @02:30PM (#2728)
4 Comments
Hardware

(This is the 46th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

Before PDAs [Personal Digital Assistants] and smartphones, people had paper diaries. Companies like Filofax made a fortune by selling modular stationary. Numerous companies made compatible stationary and it was possible to make your own stationary which fit in Filofax's miniture ring-binders. However, many people preferred to but the official branded products. Yes, before people boasted that they could animate a poo emoji with facial tracking, they would boast about carrying official Filofax graph paper. (Given the ornate designs of Bronze Age daggers, I suspect this type of boasting goes back many millennia.)

I believe that there is a market for a modular bag format in which any party is free to produce interoperable clones. Bag segments are self-contained and nominally cylindrical. All use one yard, YKK, metal tooth zippers. A bag segment may or may not have handles. One or more bag segments may or may not be used with end caps. An end cap may have functionality such as being a detachable toiletry bag.

Bag segments may include duffel bag, racksack straps, cool bag, vinyl LP bag or handbag. A handbag is one or more stubby cylindrical segments with end caps providing carry handles and/or shoulder strap. However, to be interoperable with the other segments, a modular handbag always is always has circular caps with a circumference of one yard.

If you wish to make modular bags then please have one small hole in each of the internal circular panels. This allows USB and/or the cell networking protocol to be threaded through any number of bag segments. Also, please place the hole near the seam of the zipper to minimize problems due to inadvertent ingress of water.

Common cabling, signalling and power allows a modular bag to connect to my ideal car and that can be connected to my ideal house using a protocol which does not tunnel outside of the PAN or LAN.