Stories
Slash Boxes
Comments

SoylentNews is people

Log In

Log In

Create Account  |  Retrieve Password


My Ideal Processor, Part 6

Posted by cafebabe on Tuesday December 19 2017, @09:21PM (#2872)
0 Comments
Hardware

(This is the 56th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

There is a computer problem which was known to Charles Babbage when he was designing his unsuccessful Analytical Engine. Most impressively, he had a solution.

When adding large numbers (in decimal, binary or any other base), there are a large number of carry operations from one column of numbers to the next. In the worst case, every column of interest is interlinked with its neighbors. This leads to a phenomena of ripple carry where propagation of carry requires a large number of iterations. For example, a decimal odometer may occasionally tick over from xxxxx999 to xxxxy000. For a mechanical system, this occurs synchronously where there is sufficient force to move every digit. However, even this has limitations.

Carry is no more or less likely to occur in a binary machine. However, a binary machine requires the most digits to represent a given number and therefore the worst case has the most impact. In any base, for each column of addition, there are two numerical inputs, one numerical output, one carry in and one carry out. In any base, carry is zero or one. For the least significant digit, carry in is always zero. However, this only skews the average case in our favor and does nothing to prevent the worst case.

For any base b, there are b^2 permutations of digits to add. Within a given range of precision, digits may occur with fairly equal probability. (This is particularly true for floating point operations.) Taking into account carry input, there are 2(b^2) inputs for each column of addition. For a digit of decimal addition, there are 200 permutations of input. For base b, where carry in is zero, there are b(b-1)/2 outputs with carry out of one. In decimal, this is 45 permutations with carry out of one. For base b, where carry in is one, there are b(b+1)/2 outputs with carry out of one. In decimal, this is 55 permutations with carry out of one. In total, exactly 100 of the 200 input permutations produce a carry.

This was known to Charles Babbage and he devised a mechanical system to accelerate this process. Although Charles Babbage had difficulty obtaining funding and difficulty manufacturing an Analytical Engine, this system has been partially implemented in metal and a Difference Engine was made in Lego Technic.

When the scale of integration was low, mini-computers typically used 74 Series logic chips. One of these designs is a 4 bit full-adder. It accepts 4 bits from one source, 4 bits from another source, has carry in, carry out and provides four bits of data out. Of course, exactly half of the input permutations produce a carry out of one.

Historically, propagation from carry in to carry out would take more than 70 nanoseconds and less than 10ns of this propagation delay occurred due to the silicon chip anti-static protection circuitry when a signal went off of a chip. Anyhow, adding 16 bit numbers could take more than 280ns. Addition of 64 bit numbers was infeasibly slow. It also required circuitry which was infeasibly large and expensive. It was preferable to work on smaller units and make, for example, addition of 16 bit units, a composable operation via a carry flag in a flag register. On this basis, it was possible to add (or increment) numbers of arbitrary size in a manner which was relatively fast for small numbers, only incurred overhead for the most rare cases, and used available hardware efficiently.

My introduction to this topic came via an unlikely path. Chinese Remainder Theorem can be used as a method to avoid long chains of ripple carry. CRT is very good for counting, addition and subtraction. CRT is exceptionally bad for comparison operations and the Chinese used it to calculate fractions despite this limitation. In the classic arrangement, each number (or fraction of a number) is represented by the remainders of division by five, eight and nine. These three co-prime modulos have a product of 360. (This technique was probably obtained from the Babylonians and even then it may not have been original.) Addition and subtraction can be performed on each digit of the triples. After normalization, a lookup table was used to retrieve common answers. So, for example, 60 + 90 = 150. As triples, (0,4,6) + (0,2,0) = (0,6,6). The classic arrangement is partly a notational problem. Regardless, nine bit computation is reduced to parallel computation requiring no more than four bits. In principle, this can be vastly extended.

Apparently, CRT can be used to measure sonar ping times much more accurately without consideration for long chains of ripple carry. Unfortunately, it does this by pushing a problem elsewhere. Fortunately, there are other techniques to amortize the problem of ripple carry. One technique is to use a vector pipeline. If a vector pipeline has four execution units with staggered timing (like a four cylinder combustion engine) then ripple carry may propagate within units while more data is fed to subsequent units. Unfortunately, in the case of one addition, the majority of the system is unused. Furthermore, no speed advantage occurs.

74 Series provides a carry accelerator chip. The chip itself is composable. Therefore, addition of any size may be arranged as a quadtree where 4 bit full-adders are the leaves and carry accelerators are the branches. So, one tier of branches (one chip) accelerates addition up to and including 16 bits. Two tiers of branches accelerate addition up to and including 64 bits. Three or more tiers are used for larger numbers.

Historically, many CPUs had a distinction between data registers and address registers. Likewise, old IBM systems allowed integer and floating point calculations to occur in parallel so that ForTran programs could calculate array indexes and scientific data in parallel. However, with a trend towards general purpose registers, typically with a maximum of a 4 bit operand reference, address computation goes through a general ALU datapath. This may be 128 bits or more - even when an address-space is significantly smaller.

How often does this occur? On a system with eight registers (ARM Thumb BCM8255 Raspberry Pi with Raspbian and gcc), I spent more than three days getting a bit transpose operation working efficiently and then further effort to get it working portably on AVR and other architectures. The most significant part of the solution was to reduce register pressure by performing address arithmetic. This allowed all input to read via one register and all output to be written via another register. However, on a CPU with wide registers, incrementing an address pointer may be a slow operation with little opportunity to amortize any delay.

At a higher scale of integration, addition does not have to be performed in 4 bit units. However, stretching carry acceleration from a quadtree to a quintree or similar provides minimal gain. Most annoyingly, 5 bit full-adders and two tiers of 5 bit acceleration only provides 125 bit addition. Likewise, 7 bit full-adders and two tiers of 6 bit acceleration only provides 252 bit addition. It is inescapable that addition for larger numbers is slower. Variants can be devised to match or exceed 128 bit, 256 bit, 512 bit or 1024 bit thresholds but I'm not sure that any provide superior performance to a quadtree. Specifically, if it takes four units of time to add 4 bit numbers then it takes eight units of time to add 16 bit numbers, 12 units of time to add 64 bit numbers, 16 units of time to add 256 bit numbers and 20 units of time to add 1024 bit numbers. Whereas, the 125 bit example requires 15 units of time and the 252 bit example requires 19 units of time. The only advantage I could find was to use two tiers of 3 bit acceleration but this only fills one position between each pair of 4^n bit numbers. For example, 10 units of time to add 36 bit numbers and 14 units of time to add 144 bit numbers.

Timings for large additions may be hidden among instruction decode and a round-trip between registers and ALU. However, a larger problem remains unresolved. Pointer sizes are becoming ridicuously large. I could semi-sensibly use a 144 bit address-space. For example, mmap() of a 4KB block, 128 bit hash reference, Fibonacci deduplicated, sparse file. Then add more bits for virtualization. Why would I do that? Well, I could just "mount" a data-bank within a process. This could be read-only or read/write. And it would be possible to maintain accounting and billing external to the process. Regardless, long file pointers and long address pointers put significant pressure on data caches. 8 byte pointers are common and this may increase to 16 bytes or much more. With address randomization, this provides a modicum of security. It is otherwise a vast hinderance. For a typical application, such as a database server, it is estimated that the transition from 4 byte pointers to 8 byte pointers has the effect of reducing a data cache by approximately 15%. Indeed, it has been suggested that a multi-core server with 32GB RAM should run two 32 bit instances of MySQL Server rather than one 64 bit instance. (The majority of the RAM is best managed by an Operating System's LRU storage cache.)

As a matter of practicality, processors may be pushed towards split banks of data registers and address registers. This isn't perfect but it may be preferable to a reduction in registers as their size increases. An asymmetrical design allows data registers to be significantly wider than address registers and computations may be performed independently on each bank. Numerous designs have elements of this, including IBM's old systems, Data General Nova, Z80, MC68000 and, generously, Intel's Pentium.

Among the designs which have a distinct split between data registers and address registers, data registers obtain the most use. Under-utilized address registers (of the same width) get used as temporary storage. It would be possible to continue this practice if register width differs. However, truncation will occur.

In the most optimistic case, it is possible to conflate operands and addressing modes to sneak an extra bit of operand reference. For example, addressing modes such as data register, address register, address register indirect or address register indirect with post increment. Firstly, this example only requires two bits to implement a variety of useful addressing modes. Secondly, an increment operation can be performed within a register while data may be processed elsewhere. Incrementing a register by one, two, four and/or more is trivial if ripple carry is not a concern. However, rapidly incrementing or decrementing a register is a much larger problem.

It may be useful to provide two separate stacks for data and addresses. This would be in addition to a return address stack. For ease of implementation and maximum performance, each type of stack could be aligned to the width of each type of register. It may be possible for references into an address stack to be read-only or entirely absent.

In summary, the search for fast arithmetic has been ongoing for millennia. Unfortunately, the problem is getting worse rather than better. This is due to the potential volume of data and a secondary effect where the volume of data requires larger references.

My Ideal Processor, Part 5

Posted by cafebabe on Tuesday December 19 2017, @09:20PM (#2871)
0 Comments
Hardware

(This is the 55th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

There is a famous result that 1/2 of instructions executed are move instructions and 1/6 of instructions executed are branches. A subset of the remaining 1/3 are ALU, FPU or similar instructions.

Maximize ALU And FPU Throughput

Ideally, a processor should allow full utilization of an ALU and/or FPU. However, an instruction stream only provides suitable instructions in 1/3 of cases. Attempts to increase instruction throughput may lead to additional ALUs which are also under-utilized. An example is the Intel Pentium which provides a supplementary integer unit. Processor architectures, such as MIPS10000 permit zero, one or two FPUs on the basis that FPU operations take two or more clock cycles whereas ALU operations generally take one clock cycle. One processor architecture allowed up to 16 co-processors and any of these may be FPUs. However, the physical arrangement of co-processors incurred a one clock cycle penalty per co-processor as instructions were handed along a daisy-chain. In the case of multiple FPUs, each successive co-processor provided less throughput.

Increased throughput may lead to similar paths of development on multiple processor architectures. Development of SIMD and vector pipelines on x86 was almost identical on ARM. In the first iteration, a processor has no FPU support. In the second iteration, FPU functionality becomes increasingly integrated. In the third iteration, SIMD uses existing FPU registers. (Intel MMX, AMD 3D Now and ARM Neon.) This allows SIMD functionality to work with legacy operating systems. In the fourth iteration, SIMD obtains its own registers. In the fifth iteration, SIMD registers are expanded beyond the size of FPU registers. In the sixth iteration, a vector pipeline avoids the increased state of SIMD.

Attempts to execute multiple instructions may be taken to extremes. Out-of-order execution on x86 uses a 3 bit register reference as a tag within a RAT [Register Allocation Table] of 144 shadow registers or more. Meanwhile, data is transported to one of the numerous, specialized execution stations. Execution stations are provided in quantities which will maintain good throughput for typical workloads. Given that x86 effectively has more than 1000 instructions, intricate structures have developed to execute them effectively. However, this leads to curious holes in architectural models. For example, an Intel Pentium 3 incurred a large delay when switching between FPU instructions and SIMD instructions because the contents of the registers were transported to different parts of the processor.

As register width has increased to 32 bits or more, it is sufficient to represent IEEE754 single precision values (or larger) in one general purpose register. Indeed, given the monotonic representation of IEEE754, integer and floating point operations may share hardware. Only positive infinity, negative infinity, not a number and, possibly denormalized values require special handling.

(The monotonic, 2-compliment representation of IEEE754 allowed an integer one bit right shift (integer divide by two) to be used as an approximation for a function used in a Newton-Raphson method for square root calculations; most commonly used in Quake or ray-tracing distance calculations. For square root approximation, the right shift of the positive mantissa into the top bit of the exponent is inconsequential and the right shift of the bottom bit of the exponent into the mantissa remains monotonic. This leads to a Newton-Raphson approximation which always converges. However, it requires the use of a very opaque constant. This technique fell out of general use when direct manipulation of FPU state incurred more clock cycles than using the hardware floating point square root function. However, this wasn't soon enough.)

The distinction between ALU operations and FPU operations is decreasing. It is also possible to eliminate a distinction between ALU and SIMD. Like MC68000 and ARM, a FirePath instruction typically specifies data size. However, rather than only process one unit of data, multiple units of data may be processed in parallel. This may seem like a waste of energy but many ALUs perform computation on a full datapath even if a register only latches the bottom bits of the computation. So, instead of conditionally spliting a register into pieces, it is more efficient (and more symmetric) to conditionally split an ALU into pieces. For addition operations, carry operations may be conditionally inhibited on byte or word boundaries. This incurs very little penalty and may increase computation speed where registers are wider than 64 bits.

Share ALU Among Hyper-Threads

I previously considered an in-order, super-scalar processor architecture where one execution unit implements a full range of instructions including branches, traps and fences in addition to all ALU and FPU operations. Successive units implement a diminishing subset of instructions. This may extend beyond eight units. Unfortunately, this arrangement is grossly inefficient for multiple reasons. In the general case, the first ALU will obtain almost 1/3 utilization and subsequent units will be used sporadically. Total ALU utilization will be about 1/3 of one unit. Ignoring this, parallel execution of instructions is limited by source and destination references. For operations on registers only, each instruction may remove one or more registers from the pool of uncontested references. The pool is quickly exhausted by typical reference patterns. There is a small upper bound when register references are typically three or four bits. Whereas, instruction density is greatly compromised and context switching is greatly increased if register references are larger. Other addressing modes create further problems. For example, multiple reads via virtual memory. In this case, one or more memory exceptions may occur along a sequence of instructions executed in parallel. Without the benefit of out-of-order execution, FPU timings lead to further diminishing returns.

Share Exotic Instruction Implementation Among Hyper-Threads

While considering super-scalar processor architectures, hyper-threading, ALU throughput and attempts at energy conservation through the use of lightweight cores, it occurred to me that something akin to AMD's Bulldozer processor architecture can be generalized. AMD learned from one of Intel's mistakes. Intel's Pentium 4 was unpopular because it failed to improve execution speed of legacy code. It was a good design which worked well with the output of contemporary compilers. However, some legacy code ran slower on an Intel Pentium 4. AMD was careful not to repeat this with the Bulldozer processor architecture and devised an efficient scheme for SIMD. (For legal reasons, I'm going to be careful with the terms processor, core and hyper-thread.) For every pair of processor hyper-threads, SIMD execution units would fuse together for the execution of 256 bit SIMD. This did not adversely affect legacy code if it did not have SIMD instructions. This did not adversely affect legacy code with 128 bit instructions. Crucially, it did not significantly affect legacy code with 256 bit SIMD. This arrangement disproportionately reduced the size of each processor hyper-thread (and residual energy consumption) while allowing more hyper-threads per chip or more chips per wafer. Unfortunately, AMD was successfully sued by consumers who thought they were being defrauded. (Given that AMD had made extensive effort to ensure smooth operation for a transitional design, I doubt that many sued in good faith. Furthermore, it is my preferance that AMD won this case and that consumers suing harddisk manufacturers over deceptive storage capacity were successful. However, that would affect more parties over a longer period of time.)

My intention is to fully utilize an ALU. This is above all other considerations. Indeed, the very important consideration of code density is for the purpose of maximizing ALU throughput. If there is a bottleneck from main memory to instruction decode that ALU capacity is wasted. However, when instructions are brought to decode, approximately 2/3 of them do not instruct an ALU to do anything. The trivial solution to this problem would be share one ALU among three instruction streams. For workloads of three or more threads, this arrangement approaches 100% ALU utilization. If ALU saturation is the objective (and practical implementation), it is possible to skew the ratio and have three ALUs and 10 sets of registers. My experience is that common instructions may be implemented locally per register set or per pair of register sets. Exotic instructions can be shared more broadly. As a practical consideration, it is desirable to keep wires short. This reduces latency. It is also desirable to minimize the number of transistors in critical paths; ideally to 14 or less. This reduces switching times and allows a processor to operate at faster speeds.

If each register set has a local addition unit and a local left shift unit, this is sufficient to implement arbitrary addressing modes. This allows most data structures to be traversed without invoking a global lock on a shared ALU. Multiple multiplication units would be a benefit for many applications. The traditional workaround to FPU being slower than ALU is to have two FPUs. Where there are an even number of register sets, half may have access to a particular FPU and all have access to one ALU. So, we could have a scale-out arrangement of eight register sets, eight integer addition units, eight left shift units, two integer multiplication units, two FPUs and one ALU. I do not suggest four or more integer multiplication units because a flash multiply unit requires a significant number transistors and large bursts of energy.

Floating point multiplication requires addition of exponents, multiplication of mantissa bits and a round of post-processing. Floating point multiplication may be slower and less precise but it requires less transistors and is less energy intensive. On this basis, the number of integer multiplication units should not exceed the number of floating point multiplication units. If a floating point unit has two or more multiplication units (to exploit Euler's identity of exp(i z) = cos(z) = i sin(z) or otherwise accelerate calculations) then it may be sensible to correspondingly increase integer multiplication units.

With two integer multiplication units, two FPUs and one ALU, there would be five locks on shared resources. If there was no correlation between instructions or instruction streams (and all instructions had the same execution time) then eight hyper-threads incurring a lock on 1/3 of the instructions would incur 8/3 of lock contention over five locks. With local move, local addition and/or other uncontested instructions, a lock may be required for less than 1/5 of instructions. Total lock contention would be correspondingly reduced or permit additional hyper-threads. Unfortunately, this is optimistic. Hyper-threads are more likely than average to execute the same ALU operation or FPU operation. Perversely, this makes fine-grain resource locking less useful than average.

For reliable execution, all registers, execution units and datapaths should be doubled. If any sustained bit error eminates from a register set then an exception occurs. It is not recommended that reliable and unreliable hyper-threads are mixed within the same system. Reliable hyper-threads may be implemented in a form which wastes a binary tier of cache coherence and/or main memory interface. Reliable hyper-threads also require execution unit locks to acquired in pairs. This significantly reduces utilization of execution units and, in the general case, invites deadlock scenarios. Furthermore, deployment of a reliable kernel and unreliable application greatly complicates interrupts, system calls and process threading. Specifically, switch from user-space to kernel-space may require a neighboring hyper-thread to be suspended so that kernel code can be executed while maintaining a second copy of the registers. This may also cause deadlock. Specifically, where the number of hyper-threads varies, priority inversion may occur.

Fused Instructions

It may be useful to consider fused instructions. This is where two or more common instructions are combined into one operation. I shied away from fused instructions for many reasons:-

  • Fused instructions are contrary to RISC principles.
  • Fused instructions are not downward compatible.
  • A fused instruction may be a performance tweak which hinders further development.
  • The most common case, conditional move, appears to reduce ALU throughput - and this is already problematic.
  • Another common case, decrement and conditionally branch, is deprecated on x86 because it is problematic to implement and provides minimal gains.
  • Instructions, such as decrement and conditionally branch, may encourage side-effects when they should be discouraged.
  • Fused multiply and accumulate requires a datapath between two execution units.
  • Fused instructions may require more operands and this reduces code density. In principle, fused multiply should be A = B + C * D.
  • Fused instructions may increase energy consumption - even if they are not used.

Despite these concerns, it may be beneficial to investigate. An example of a fused instruction would be a conditional move. Moves are common. Conditional branches are common and conditional branches may be made around move instructions. When moves typically take one clock cycle and branches typically take two or more clock cycles, a conditional move has considerable advantage even if the throughput of datapaths is modulated.

PIC generalizes this to the extent of having an "ignore the next instruction" instuction (which can be daisy-chained). This is typically used to implement conditional branches but may also be used to implement conditional move and conditional arithmetic.

I was appalled that ARM implemented conditions as a 4 bit field on every instruction. There's a difference between theory and practice. For example, to obtain sensible code density, it is preferable to not have condition codes on all of the millions of system calls. Admittedly, this consistency has allowed implementation of Thumb mode. However, this has replaced one inefficiency with another. From disassembling Thumb code, I confirm that the 16 bit instruction format has a 4 bit conditional field and one value for this field ("never?") defines the 32 bit instruction format - but without the ability to execute it conditionally. Meanwhile, this 2:1 instruction size difference has inferior code density to Xtensa's 3:2 instruction size difference.

Move is historically implemented via an ALU where one of two inputs is faithfully propagated. On 3-address RISC architectures, move may be implemented as addition with a constant zero or addition with a dummy zero register. However, even if the datapath for a move operation conceptually goes around an ALU, it is wasteful to actually send it around an ALU. Furthermore, for a 3-address processor architecture, it may be more efficient for any 2-address operation (move, not, negate) to use a spare operand field as a conditional field. A 3 bit field allows implementation of eight conditions: "always", "never" and six inequality tests. "Never" may be used as an escape value or a privileged instruction, such as a system call. A 4 bit field allows a wider range of tests.

Multiply and accumulate is a common fused instruction for integer and floating point operations. Technically, it should be 4-address instruction because it reads three operands and writes one operand. Thankfully, it is quite practical to implement within a 3-address processor architecture. Unfortunately, as currently proposed, the datapath for FPU MAC and integer MAC may or may not have symmetry. For floating point MAC, data conceptually flows from a multiplication unit to an addition unit, in one clock cycle, within an FPU. Whereas, for integer MAC, an integer addition unit may be local to a hyper-thread on the basis that integer addition may also be used for address calculation. However, efficient instruction decode may encourage symmetry between integer MAC and floating point MAC. Therefore, an integer addition unit located with a shared integer multiplication unit may provide the most symmetry with FPU functionality. It also avoids having eight or more addition units which must switch between two sets of inputs: either two source operands (for addition) or the multiplied output of the two source operands added with the destination operand (for multiply and accumulate). However, if an addition unit local to a register set does not have multiplexed inputs then the input to a destination register requires multiplexing from an additional remote output. This is a typical example where savings (transistors, latency) are not as great as initially expected. It also shows that purmutations must be enumerated to find an optimal solution.

Pre-scaling one input is common on ARM. A 4 bit field allows a wide variety of even left shifts. (With a penalty of one instruction and one register, less common odd left shifts can also be obtained.) Where there is an asymmetry of addressing modes, it is most useful to have pre-scaling on the operand which has the widest range of addressing modes. This includes constants. Given that left shift can be applied everywhere, it is not required as a separate instruction. Likewise for addition.

Fibonacci Length Instructions

Consideration of opcode bit patterns should allow local units and shared units to be arranged in a manner which does not create tortuous decode logic. Ideally, want 1 byte, 2 byte, 3 byte and/or 5 byte instructions which are BER and where extension bits form part of the opcode. To reduce BER overhead where bits provide no gain to opcode, may have Fibonacci sequence of 1 byte, 2 byte, 3 byte, 5 byte or 8 byte instructions. Could implement 1 byte, 2 byte, 4 byte or 8 byte instructions but this reduces instruction density and doesn't greatly simplify execution of an unaligned instruction stream.

Within the first byte of an instruction, could have 00sssddd for move, 01sssddd for 2-address integer addition and 1xxxxxxx for longer instructions. Furthermore, sss and ddd may represent even and odd registers. So, 00-000-000 represents MOV R0,R1 rather than a less useful MOV R0,R0. In this case, instruction density matches or exceeds Z80 or x86. It also provides implementation options for local units.

Next byte provides additional bits for opcode and operand. This may extend beyond register only and immediate mode. Next byte provides 3-address operand (or condition code) and pre-scaler. There is no advantage by providing these fields in shorter instructions. Next byte may provide more addressing modes and immediate data. For the different instruction lengths, opcode fields may be regarded as separate instruction-spaces. Although it is not essential for there to be a correlation between opcodes presented by instructions of different length, it provides implementation options. It may be beneficial for even opcodes to specify integer operations and odd opcodes to specify floating point operations. It may also be beneficial to have a correlation between integer opcodes and floating point opcodes. For example, the following instructions may have contiguous opcodes:-

  • Integer move.
  • Floating point "move" casts int to float.
  • Integer move with accumulate.
  • Floating point move with accumulate.

And correspondingly:-

  • Integer multiply.
  • Floating point multiply.
  • Integer multiply with accumulate.
  • Floating point multiply with accumulate.

The size of the datapath may be split across multiple bytes of the instruction. In the trivial case, it may be beneficial to specify 8 bit or 16 bit operation within the first two bytes of an instruction. Potentially, a 16 bit processor would not have to decode an instruction which is longer than 16 bits. However, if instructions are not aligned on even byte boundaries, this could be stretched to a one bit field in the third byte. On 32 bit processors and 64 bit processors, an additional bit may appear in longer instructions. On implementations up to 1024 bit, where datapaths are assumed to be wide and/or fast, a third bit to specify data size may only occur in much larger instructions. A complication to this arrangement is that floating point values are typically 16 bit, 32 bit, 64 bit, 80 bit, 96 bit, 128 bit, 192 bit or larger. 80 bit is deprecated but other values may have to be shuffled around. Regardless, the default width is the size of the machine and therefore longer instructions may only be required to coax a machine into SIMD.

Historically, it was useful to arrange opcodes to maximize use of logic gates within ALU. For example, gates may be shared between addition and exclusive or. Nowadays, it may be more useful to arrange opcodes by the number of operands and the envisioned distribution of computational units.

Instructions As Binary Fractions

When moving from fixed length instructions to variable length instructions, I had the epiphany that instructions should not be considered as numbered patterns from 0 to 65535 (or suchlike) with optional extraneous data. Instead, instructions should be considered as a binary fraction from 0 to 1. Ideally, one or more patterns within the shortest length instruction should be privileged. To do otherwise creates difficulty when attempting to set an unlimited number of breakpoints. (Multiple breakpoint addresses? Reserved pattern? Debug mode? Trampolines?) Furthermore, it is beneficial to have certain instructions within contiguous ranges. For example, breakpoints, traps, fences and branches. In the trivial case, one contiguous range of instructions is sequential. This provides the most implementation options when implementing a scale-out implementation.

Stack Width And Register Rotation

Would like to implement register rotation, as popularized by 3-address SPARC. Indeed, if stack operations are exclusively register width and there are separate stacks for data and program return addresses (possibly different widths) then it is trivial to tie register rotation to the bottom bits of a data stack pointer. Unfortuately, this convenience hardware would impede super-scalar implementation. Indeed, MIPS and SPARC provide numerous examples of features which provide advantage in the trivial case but hinder more advanced implementation. An example from MIPS would be its deprecated branch delay slot.

If consistent instruction fields are used for memory and stack operations and instruction fields define datapath width but stack operations are always the full width then the unused datapath width field of a stack operation may be used to modify a data stack pointer by multiple values. Indeed, available choices may be skewed for push and pop operations. This would eliminate tedium around stack frames in many cases.

Summary

Processor design is typical technical subject where one design decision excludes another. Within general purpose processor design, decisions about register types and quantities leads to decisions about instruction format which lead to decisions about instruction execution. This chain of decisions extends below Boolean logic and down to the length of wires on a silicon chip, number of transistors switched and any stray alpha radiation. (That is not an exageration.) A design may look excellent but have a terrible implementation. For a given design, no good implementation may be possible. It is invariably possible to extend a bad architecture via escape patterns. However, fixing the first iteration provides the most gains. Unfortunately, this requires consideration over many tiers where nothing is immutable. This can infuriate; especially when design veers towards an established design which is known to be flawed. Thankfully, if there is great foresight, it is possible to design something trivial which keeps the most useful options open.

I've found that wide registers allow amalgamation of integer, float and SIMD. I've found that designing synchronous hyper-threading is vastly easier than designing any form of super-scalar. This is particularly beneficial in conjunction with NUMA or virtual memory. I've also found that it is possible to apply pre-scale and/or accumulate to all relevant instructions with minimal penalty. It is also possible to fill spare fields with condition codes or stack modifiers. Oddly, this occurs in the most beneficial places. While some may be concerned that my largest proposal for a processor requires more than 16 kilobits of state for each of eight or more hyper-threads per core, the smallest implementation is much closer to an RCA1802 than a Sun Niagra, Intel Itanium or one of ARM's scale-out designs.

My Ideal Processor, Part 4

Posted by cafebabe on Tuesday December 19 2017, @09:17PM (#2870)
0 Comments
Hardware

(This is the 54th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

It is a terrible situation when a person's best vaporware falls behind a shipping product. In my case, my vaporware fell behind the Xtensa processor architecture and I've spent at least a week rectifying this situation. I hope this isn't entirely idle work but there is definite satisfaction when devising something which might squeak ahead of a competitive product.

However, my objectives for processor design may not be obvious so I list them here:-

  • Design a practical processor architecture which can be implemented. My previous design (with three source datapaths and two destination datapaths) does not scale to the extent of other architectures.

  • Design a reliable processor architecture. I'm not following the classic telephone company philosophy of ensuring that every component exceeds 99.999999% reliability (with the intention that a path through a system exceeds 99.9999% reliability). However, if there is an obvious or non-obvious method of greatly increasing security or reliability then it should be taken. An example is watchdog timer functionality. An embedded system with a watchdog timer may be significantly more reliabile than a system without a watchdog timer.

  • Design a scalable processor architecture. As an example, the Alpha AXP processor architecture was intended to provide 1000 times the performance of a DEC VAX. 10 times performance was expected via scale of integration and associated increase in processor frequency. 10 times performance was expected via clean implementation to aid super-scalar design and suchlike. 10 times performance was expected via multiple processors sharing the same banks of memory. In another example, compiled code for 16 bit stack-based a Transputer was upwardly compatible with a 32 bit Transputer. Indeed, the Transputer processor architecture was very much like a Burroughs B6000 transported into the era of the 80286. Like its spiritual mini-computer predecessor, Transputer implementations are very interchangable because code utilizes very few architectural quirks.

    It would be desirable to implement a processor architecture which works as a simple 16 bit embedded processor with minimal energy consumption and also works as a 64 bit (or more) architecture in clusters of 4096 nodes (or more). Clustering requires very good MTBF otherwise a system will crash or produce junk results very frequently. Likewise, if there are a significant number of nodes scattered like dust then reliability may also be problematic. In the worst case, sensors behind enemy lines or poured into concrete may be very difficult to replace. Even if it is easy to perform maintenance, it is cost effective to have a US$60000 per year technician who reboots lightbulbs?

  • Design a secure processor architecture. Popek And Goldberg virtualization requirements are not negotiable. AMD agrees but Intel thinks that your security is a matter of market segmentation. Overall, techniques are strongly encouraged if they reduce or eliminate buffer overflow, heap overflow, stack execution, gadget execution or allow malicious code transformation into 7 bit ASCII, UTF-8 or UCS2. That may require multiple stacks, tagged memory and/or encrypted instruction streams. I'm particularly reluctant about abuse of the latter but Bruce Scheier (in the book Applied Cryptography?) states that it is better to have 10% of the resources of a trusted computer rather than 100% of the resources of an untrusted computer. Unfortunately, "It is difficult to get a man to understand something, when his salary depends on his not understanding it!" and many people don't understand that trust comes from the user and is for the benefit of the user.

  • Design a clean processor architecture. This encourages design of practical, reliable, scalable, secure processor architecture while also allowing upward compatibility. If there is a "cool" feature then it should be avoided because it may be an architectural dead-end.

    My preference is for a RISC processor architecture with flat memory, fixed length instructions, 3-address machine, general purpose registers which exclude program stack and program counter, a dedicated zero register, no flag register, no shadow registers, separate program and data stacks, register rotation, optional FPU, no SIMD and no vector pipeline. The primary reason to deviate from these choices is to increase code density. This is particularly important to maintain competitiveness. Without competitive code density, instruction storage, caching and throughput is greatly reduced. There is little to be gained by saving a few transistors inside of an ALU if millions of additional transistors are required outside of an ALU. Other reasons to deviate from a clean processor architecture include leveraging existing design elements, greatly increased throughput, compatibility with legacy code and/or maintaining the expectations of a general purpose computer.

  • Design for real applications not benchmarks. If a design works well with a mixed workload then it should perform moderately or better with a parallel workload. However, if a design works well with a parallel workload then it may not perform well with a mixed workload. As an example, GPUs work well with parallel tasks, such as graphics, but are otherwise slower than general purpose computers. In general, it should be preferable to add more processor cores rather than add more ALUs, more FPUs or more vector units.

    A scale-out version of the previous design provided one hyper-thread and eight or more ALUs. However, utilization of ALUs is grossly inefficient. Across multiple architectures, approximately 1/2 of instructions executed are move operations and 1/6 of instructions are branch instructions. Therefore, in the general case, 1/3 of instructions are ALU or other instructions. Scale-out version of the next design provides eight hyper-threads, one ALU and supplemental units. If there is one active thread then performance is almost equal to the previous design. However, if there are eight or more active threads then the ALU approaches 100% utilization. Per transistor, this is more than eight times more efficient. In general, it is preferable to be bounded by computation rather than memory bandwidth because computational units may be more easily duplicated when memory bandwidth is not exhausted. (This also applies on a larger scale. Specifically, it is easier to scale application servers rather than database servers because application servers may not share any state.)

  • Implement old ideas. This is especially true if patents have expired.

  • Use the best elements from other architectures. When appropriate, I have mentioned processor architectures such as RCA1802, TMS9900, 6502, Z80, MC68000, x86, AVR, ARM, MIPS and Xtensa. To a lesser extent, I have also been influenced by IBM System 360, Data General Nova, Burroughs B6000, Transputer, XMos, PDP-11, VAX, Alpha AXP, PIC, Xerox Alto, TMS32020, MC88000, SPARC, PowerPC, Intel 432, Intel 860, Intel Itanium, MIL-STD-1750, FirePath, MMIX, NandToTetris, RISC-V and a mad balanced ternary design by the Russians.

    It is possible to find similarities between processor architectures. For example, there are similarities between instruction fields on Z80 and MC68000. Or similarities between the three 16 bit index registers on Z80 and AVR. Or the pattern of one data register, one address register on NandToTetris, 2 data registers, 2 address registers on a Data General Nova and eight data registers, eight address registers on MC68000. Or the hybrid of MC68000, x86 and RISC available in XMos.

    It is also possible to synthesize ideas. For example, a 3-address processor architecture with eight or more stacks may be feasible. Or consider an SIMD version of MC68000 where ADD.B performs four 8 bit additions rather than modifying the bottom 8 bits of a 32 bit register.

  • Don't guess. Use empirical reasearch. I've comprehensively shown that it is practical to restrict subroutine calls, loops and forward branches to instruction cache-line boundaries. Indeed, even on architectures which do not enforce this restriction, it is beneficial to do so. Specifically, MySQL Server natively compiled with gcc on a Raspberry Pi with 64 byte boundaries leads to stored procedure execution times being reduced by 45%. An architecture which pre-scales addresses would obtain better performance (and possibly better security).

    It is tedious but possible to simulate queues, memory access patterns and other resource contention. Indeed, it is possible to simulate a hyper-threaded processor running real applications, such as a word processor or web browser. From this, it is possible to collect statistics, such as instruction length, opcode frequency, conditional branch frequency, register usage, addressing mode frequency, cache hits, ALU contention, FPU contention and memory contention. It would also be possible to collect more detailed statistics, such as determining the most frequently executed instruction before or after usage of a particular opcode, register or addressing mode. It is also possible to collect cycle accurate timings through particular sequences of code.

  • Avoid side-effects and conflation. This can be difficult because established practice become ingrained. A good example would be a flag register. This is state which lingers between instructions. Furthermore, modifications to flags occur as a side-effect of instruction execution. This may create difficulties for super-scalar implementation. It may also increases the duration of context switching. A further difficulty with flags can be found by comparing MC68000 against MC68010. The minor step revision correctly enforces privileges on half of the 16 bit flag register. This allows virtualization to work correctly.

    A practice discontinued after ARMv1 or so was the conflation of program counter and user flags. Furthermore, RISCOS continued the practice (from Acorn's 8 bit computers) where system calls returned information in the carry flag and therefore it was possible to make a system call and then conditionally branch. In the kernel, this required setting or clearing the bottom bit of element zero of the stack. Although this conflation eases context switching, it greatly reduces the address-space for executable code. This was amortized through the observation that aligned, 4 byte instructions don't require the program counter to hold anything useful in the bottom two bits. Unfortunately, this observation is incompatible with 2 byte Thumb instructions and 1 byte Jazelle instructions.

  • Avoid fragmentation. ARM, Java and Android are particularly bad in this regard - and there is a strong correlation between these architectures.

There may be further conflicting constraints.

A New Internet

Posted by turgid on Thursday December 14 2017, @09:18PM (#2856)
21 Comments
Digital Liberty

The USA just decided apparently to abolish "Net Neutrality" making the public Internet beholden to large, established corporations. This is bad news for individuals and small businesses.

What we need is a new internet, a grass-roots one, ad-hoc, created by volunteers.

Many years ago when WiFi was new, there was one such attempt I seem to remember called "Consume the Net." I never had the money to buy the hardware at the time, but it sounded like a great idea. The problem in those days was getting any sort of broadband connection was difficult and expensive. You could get a 56kbps POTS modem, sometimes ISDN (64k * 2) or cable (500-600kbps) if you were very lucky and ADSL was just coming out. WiFi was already running at megabits.

Now we have a different set of problems to work around, but the technology is ubiquitous, cheap and mature.

It would we cool to have the equivalent of open access points on this new co-operative internet that you could scan for and join if you promised to behave.

Any ideas?

Harry Potter by Algorithm

Posted by turgid on Wednesday December 13 2017, @09:36PM (#2852)
5 Comments
Software

The Guardian has a story about some Harry Potter stories written by Botnki's predictive text keyboard (complete with link to github).

        “He saw Harry and immediately began to eat Hermione’s family. Ron’s Ron shirt was just as bad as Ron himself.

        ‘If you two can’t clump happily, I’m going to get aggressive,’ confessed the reasonable Hermione.”

So not much worse that the original.

Origami Cube Bitmaps For Christmas Decorations And Suchlike

Posted by cafebabe on Tuesday December 12 2017, @08:42PM (#2847)
2 Comments
Code

For Christmas 2016, I released a rotating torus and fractal animation. Admittedly, this was test data for a video codec but it was also a fun and frivolous software. For Christmas 2017, I've automated the algorithm for converting 1, 2, 3 or 6 bitmaps into a bitmap which can be folded into origami cube with the pictures on all sides. This has a variety of uses including corporate branding; like an unusual business card. It is also great for making unique Christmas decorations. Indeed, I've now got some really strange Christmas decorations including cubes of:-

  • A six sided dice.
  • James Bond actors.
  • Clover, Sam and Alex from Totally Spies.
  • Pinky Pie from My Little Pony: Friendship Is Magic.
  • Bob Dobbs, the Cheshire Cat from Disney's version of Alice In Wonderland and a hexagonal toroid from Mathematica.
  • The Ford Escort RS2000 from Fast And Furious 6.
  • Characters from Wingi's BunniSpace.
  • Goatse.
  • Tubgirl.

The program uses the current verion of the generic Makefile for C programs using libgd. This works with a local copy of libgd-2.2.2 and libgd-2.2.5. Like previous programs, it also requires a local copy of libpng and zlib. On Linux, via bash, invocation is as follows:-

export LD_LIBRARY_PATH=$PWD/build/lib:$LD_LIBRARY_PATH
./origami-cube 1.png 2.png 3.png 4.png 5.png 6.png output.png

If you don't have six images, it also works as follows:-

./origami-cube 1.png 1.png 1.png 1.png 1.png 1.png output.png
./origami-cube 1.png 1.png 1.png 2.png 2.png 2.png output.png
./origami-cube 1.png 2.png 3.png 3.png 2.png 1.png output.png

Further documentation is inside the archive:-

begin 644 origami-cube-dist-20171211-204606.tar.gz
M'XL("([N+EH"`V]R:6=A;6DM8W5B92UD:7-T+3(P,3<Q,C$Q+3(P-#8P-BYT
M87(`[1MI<]NV,E_+7X$F<2TYHL1#E&RGRHSC'/74.<9QYM6O3A.*A&2.*5+E
M(<MYROOM;Q<`Q4.D#CO-)/.,3$T<NXO=Q6*Q"ZA^X`S-D2-;<9_*MA-&LJ:H
M75535:BT.TJG=?3ZW>G!\?&]FQ<%2J?3QJ_:-93LEQ55[=Y3=4735<W0$4[5
MNYW./:+<^P8E#B,S(.3>_VF)+IR0A/X@NC(#2L(+/W9MTJ<DB#TR"/P1B2XH
ML>(@H%Y$;">@5N0'UR0PH3^`0=,CC@<J=%UJ-R7IWEWYL8J_<O^_,B_IP''I
M+?=_NVK_JXK&]K^FZH;1-A!.;;>5]MW^_Q;E02L.@U;?\5K4FY`1K+4D/2!#
MZM'`L4BR]F3@!V3DVS%4`_IW[`2.-R2NTQ_:`%T[K(/1=.0N.05G<>A[H1]$
M3CQJ2@]@%(?0G@@J.N(=8&"*3D)S0N=M!,"V].K@]^<]QLBSHW>GO9/G!\]>
M/2?B%$HYVFE:THLW;WIT.H;9R-O?7WX\?//ZQ='+'I-F8+HA);_\0L3XX8OC
M@Y?O>MORT</:VW\]J[?ZL>/:+<>SW-BF1#[.=8-D1'ZCDX<UCE??SI`Z?I;0
M6D1Z6!.C!8R/QT=/3PY.SCZ^/3C]K;==Q-M'O!Q,?5N2P*GND\#W(TERO(@.
M`R>Z_N@H^])/`\>SB1Q=CV%EB!PZGREYI!!Y#,L2*61&IF8P#(FL@$,WPWA$
MG@!CKP]>/:__5_K),J-Y"T!QK8A\2;042/HI&*4(Z=PPL1`(AWKS?OF3;4;`
MP=;9UFC+EK=^VWJU]>Y3,YI&J(.'-5S1.LF*`+(%UD='!8HURR;-)@*.+N%\
M(:";9T<G=?EA[=W[%TK]8>WT"-C8!J:L,4I(/KD@F4T0[)U2)]H3TK+II.7%
MKON);#>;K7("V9EPQ\NR%X_0R&7_"HR=R!:1!Q6S-P%ANXJSUDZ=S&;DO#B!
M=4MJS(+94LC!8*E@P\\.J&:/\5\%QB?EE/L`KVV",/V\"704Q%0LL)*:#"#U
M[G\:7]EH<]0FV^&TN=.:3K?3-C]__E3DO0_LS\Z?BOHA:>J\AL/:O-.8USJ\
MEJ=W6U*?[F<V,<H)$I29^OVLG7/+9@I(I4==]3+$T'9[VSO;"XBP-5`+!4RF
MF`5TM@5.ZZ5$+)>:'E"1A?TP5\.;G\'AI`/8DA,=R7M-+CP?AZ&Q-TQA>7L)
M]-#.`0_M*ECT:DT_K4L2G4:!:47<9CYS)W5P<IC?X+"VN+6FX/0*_<SRAC?$
MZ]\0;WI#/+ZU$Y'3M08ZO;E[>_7F&;C9.JN\/7E>YYK$C=8<?EXROG00D/N?
MM6J`_BKLZ9+Q*2`O:,%QB:Q^RIIHNM*2A-:'AX!0`"?<P]Z,M?-9>G()$3@6
M:@]K$`?4<1#\KU!?N547&&JV+-\;.,,8,A\9CA8(+*:][-%\*^H)IVPG?@U*
M7X6;B(;15R$DLCZ@Y7H0@JQ/@[D?B;N2?>&,"NO/!V]G`56^:E,;@-XK)[J0
MF6`;V,B:\Z]E)9O1^DH<+;.4#4E5V,HZ5,01)+'39']^(BT:S-"^M;V4'5>W
M,9?2`6!_7?M9@Y]US6<-4DFN<O3T[6O(I3;.ER"%R:+F$R`^`I_*M"F%R`4T
M7T<]*VQY`TK5IKR*"`^/)(QV]GF#1T%6GB](0RI4;@EX(OLB?JHOH"ZJED,2
MV851%R(S%V)Y%[VK[`('@E0=^(H]D6QF:4;.B(+1H_Z:X<4\T4D!EHT)9#9!
M&69V0(1$=Q=RW]W]'[^`N<T<*^[_U:ZNB/M__*?A_7]',>[N_[[)_=_\KD]<
MM*UST[?./9^$A/*O"Y%/KOS@DI@ANPSR;&KORY+D$_+R]7MV]YAY@("S-<0W
M!].SB<6>%_CMI(`&9[<4UK(0TJ8#,W8C`*60CJ0(?3.\P'%\W!CXKNM?H926
M/QH!B7"1,K!)"`$$-Q05D%'46`X(LK+D4_2-KVQ1"VE2PXL#G#/TX\"BF*A?
M.!,:,H5CM-`0P0V3@I\/%1+Z`3[#-"5I[)I`B46F?S3/FO].DD/'6WRR:4I0
M`0&9VLF$!J'C>P26AZ&K3:VY*]`3NB(\NSEE04!M=II:9Y$X')BWH@WX6A/_
M"<IIESXG&9+8<YU+ZEZC]0&E,=X=A[%ET3`<0(X*,SA`F]H-T.WU?`9?W%_P
MJ1+\`9[E>-Z!`0]@X8#3,!ZS"`?@!JYY">2D="*T[@?B4AUZZ91:<93IQ5<V
MO%]E(849"=$\?TX6W]3F^RB@9NA[P"=)8]#0"IQQE&"-`W_BV!L]Q*WV_SSH
MN+7_KWS_T;1V-WG_Z715A;W_J(IVY_^_16GMD/>1XSK1-3GUR2'86$3)4R<:
MF6/R`BSO#;</<@CV079:$L#S0V#AL0='<3A]WL%S0.#DWG@8Y(,DKOTUO`Y;
M`!LU+YX4>O&!(\QW#VULXYL$[*!I#;^>TF`?M4[^`[[6&=0\Y4G2(K!MHCCP
MH*_^&-I?I+1'A9XOC&?8]M>PST<.[#@(@B.^%_GY!SW@F?84.$R&`:70L`(*
M#@F<`,#TN:HN/?_*0Q<!6@K_CN&T:PK)1^8U[DL[!J=W`9[;Z;.3-81CB7D1
M00"V-&0*7D@@2?7",1#PK&NNTXGO"*9H;6@?C<PAW2'.2&ED&BI30:`(%4#=
M51JN^E@T+"6IA5.LA=.>0'[W1PVPF6I"R$^MBUI"A,!1&E*B[(L6P9.JYBH]
MY;&K_`IT7.71H_I\4`RK.*RR834_#/249-:7-'KK3*F+<S<8IXR#I"2\I5`*
MAVI82@;PBY3_]L%Z+_GPEU0`]3L1()S*JNPRZ!M(H7U74J32W$`4_7NQ*'4N
MS692B*@R%:,`D?J8Q+^,:#"D!/R"#;L;'0=&#W$`&Q>8"LG5A6-=$`R1LWX$
M@]/0A!0:WW<3=R)"P@BC!T8"(,8NT,#($'QK``2%>R']&/[S8\\VP;O1(/`#
MF,J!0!A<&&":Z$&=43Q"=DQR8;H#,D;M<-B,\V'L+_$]F8:6;>C91OLVOFF)
MH2PU$C@,7.6)JV;-!OI@W1>ZU[:F+X3B;PM68FJ+F%(%A<U8:M^8);V:)6EM
MUUMBXOPX=CQV'IO!T&I8%Q!5[>Q`?<(E$H3)#E!\G&^KA;96:.N%=KO0-@KM
M3J'=+;1W"^V](C\9BVR$UX\E'E6@7+W>_'0<L%];P,)%-NR7QGU(,<<_$V=[
MA%&]A_DD^_V<Z9&#$<10Q($C_N=S[[[0(ITZ44V=QR2"_L^]W2KZ6^$^B4-@
M<9]L80(]CB.UB?DBJVII54^K[;1JI-4.J_IQ!'6L`E,-7*@_E0_)-\NDEC(Y
M4A.CXI'B"Y#P!20Z-8:F<C18P&506@*E+X/2$ZCV,JAV`F4L@S(2J,XRJ(Z`
M&M1PU_=ZK]\?']=G,]S(V8:>;;2S#2/;Z(C&LM5D;G8?0C_TZ)B_`5?@Q-%!
MLW.AF5F8[(+H8D'@#YP([+=`+H!!7@J),T2FF,$R[\V<*8;)>8?:*'9IBUWZ
M8E=[L<NH-[+-3IT5YL6O<Q.?+4Y\MCCQV>+$9XL3G^4G/BM,S"5F.U<LNY)?
M]M,@IH>^"\=&.-W1&OA'`';7!=Q=%W!O74!U&9.["+E;?RQ6/!Z'E@E9!+,4
MW-"A6.V$@C^^/J$\,+"9!\=C3!'_YC,W"E:17ZMZQBD^`],*_.OY:;S@!ZKD
MRC#$CE+D98&/(F21=6T%ZUJ>=:V"=:W<.:W'NG8SUO45K.MYUO4*UO5RC[D>
MZ_K-6&^O8+V=9[U=P7J[W(VOQWK[9JP;*U@O.`^C@G6C_&Q9CW7C9JQW5K#>
M6?![I:QWR@^\]5CO5++.W0]>;OJA$U'A?(CO0<+"PPEQG2'\$0"/S2BB@4>N
MS!#R)JB.'(_:A([&3N#@7?TUZ5_C';R-$=/8'-.`7X";?>JZV#=Q0G9G,C`M
M.O=T>.[Y`:&83T+V$H>T(9!Q(GY98B/E^;U-B/F4-Y\I/P=#HF/4D9W.P&Y=
M+::YD`#?,"'>O+J.!4*;$5[$>Q1R*LRS`-+'_R7BR@&.V'42@$ZHY^!U-F,5
M)@0V@XP$-@W9]5*B(Z!AAN(](MR76?[%`!_)O"1?></VH_F4,W).-/(7?+'\
MBH.L-DOR@;](IIT=GP%>"_ZFE)#6;(ZI'V<Q2>?I+-<F[?GXA-')4IH0E='*
MEUFALCC>`KQ)AM+7U%,RQWE&+ZT2'LXSK+4R>(N4",)JIS.$(Z72POA,>YH?
MSU+BZS%C@PC;(J)17$O@]GS"Z/Q%,G@E/,W:)X3\@<LW*^5IANN&XR<S4B$=
M<@Q?,=\YJR606H82C`N^^;AV6J6G%A+EL*5Z0KD1^YR0I1IO90SGO-1^".,V
M/_[/VE.%C:]HS_X92K/RO;[*%R1X64J=X\HYU>H=#'@G!4J3G-=8X0OTM'OR
M#^KI*U'Z`?R3L<H_&4O\$Z+/_0_W3W^ED$;.%PC_)(S)>$JJ_%.)_R$K_%>6
M$NKM2>)_N'^:0_*QU'Z$?V+^*\&K\$_&*O]D_&#^:0;;M'4#7W#.MG?^-&_-
M(PQ<N'Q<<%J,"^;C2&N2H]0"FYG=("Z8`-[YU]83)Q7ZKF.3D%KX0EB2=V.:
MR_-MB)EU_F<Q`Q"O>4E^JY<ERTJ#O[?LR(R6NCR1R&'HFR)TJQ!RC.HK&%4%
MM^O/:U0I:.D4W56,ME<P:FS,J+HIH\9:C!IK+;VQZ=(;FR[]F@B=^0SK*B_%
M6*F+3D-;0Q?ZIJ+I%5FL[;"?RA2W\IRE+MN9!3YWV2XH=.XQB^.=_*4LV=<<
M`;][JZVK3*-9<MT&O\O!;WNU[2W=S5UF>XMS&&(.8^4<E8999'E7?/=6LVQ4
M7CBP/+MLA=2R%=+*5JA,W*Y8'<;G&L8G;^Q3Y756U5A+17*E[UO*0;="J;G[
M&O;C-L>E^?.,O8TP@NP1HON!(R_<-BG+[HHK[F$K[C@K[@\K[N8J[KT6NKOE
MW;OEW7L5XG`QQ8^'%/[;H;L?D-^5NW)7[LI=N2MWY:[<E1^J_`\]G.O'`%``
!````
`
end

Debug test data which should be saved as test1.png and suchlike:-

begin 644 test0-0-0.xcf.gz
M'XL(""JI(UH"`W1E<W0P+3`M,"YX8V8`[=UM2%-1`,;QLVEJIK7,1$+BHE$S
MG#@SB9(0&Q$2%:$1$=G<FZN]R%Y$(]*0%A%1$C$B(B)"(B(B1"(B(B)"(B0B
M0B(B)"(BHB0BPNY==^S>;=^>+Z,]%^YV=L[^4[Q^^G&V>;S^?FG0X9;<7I]+
MR$>[>BK'4ODT&.2;"OFL5!:4!T*Y:3<J(^5FF7R6=)P30CG5YQ4J<_)9I8Z5
M8KE\URK?EWOD'VEQ!/U^5R"2?+G$ZYH2*UZ_W>.R>$)>I[IH$.9P9,CGDL)!
MG]=97VIV>QQ!7S`DF1-WEI"GURXU-38ECBP#JSJHE]/>+*DU_8F9`R4=#/?;
M'=Z`1[(F7UF>',HVJ<Y9H@%O1/(&''VNL-('W>ZP*R)IZ\RI?S/Z5*0.P[00
MM1-";!D3HBL@1+13B-@:(>*%FB<E+Z'R1S6V)J^3<A2IUV->O3[):[-0'2M'
MC69<JADOTHS+-.-RS7B)^KK)HUBSIOV_4`_C:-IO+*\;X_HYX[CR8-@DQ(A)
MS/_Y_>OGW/=O7[]\_O1Q]L/[=V]GWKQ^]7+ZQ?.I9T^?/'[T\,']>Y,3=^_<
MOG7SQOCU:U>O7+YT,7[A_-C9,Z=/G3PQ>AS,1\#\&)@?!?,C8#X(Y@-@'@'S
M$)CW@WD`S'U@?@C,^\#<#>9.,.\%\X-@?@#,]X/Y/C#?"^9[P+P+S'>#^2XP
MWP'FV\&\$\RW@?E6,+>!>0>8MX/Y9C!O`_.-8+X!S%O!O`7,F\&\"<P;P;P!
MS->"N1G,5X/Y*C"O!7,)S%>">0V8KP#S:C"O`O-*,*\`<Q.8+P;S,C`O!?,2
M,"\"\T(2``F`!$`"(`&0`$@`)``2``F`!$`"(`&0`$@`>4$`PZ;$[@7='H?U
MN;*WH3BU7->3N;>A+JR?JTOL?R!J$#6(&D0-H@91@ZA!U"!J$#6(&D0-H@91
M@ZB1+ZA!!B`#D`'(`&0`,@`9@`Q`!B`#D`'(`&0`,@`9('_W-K3DRMZ&HM2R
M;5/FW@;;3OV<K2?YN0W$#>(&<8.X0=P@;A`WB!O$#>(&<8.X0=P@;A`W\AHW
MUN4*;BQ(+7=79^)&=X-^KKN-;]P@:A`UB!I$#:(&48.H0=0@:A`UB!I$#:(&
M48,?2$D"(`&0`$@`)``2``F`!$`"(`&0`$@`)``2``G@?_OLAFQ[&YIS96^#
MYJM!HS\R]S8,E.CG!JKYA9O$#>(&<8.X0=P@;A`WB!O$#>(&<8.X0=P@;A`W
MA#!:<P4W"E++L>E,W(C-ZN=B<WSC!F&#L$'8(&P0-@@;A`W"!F&#L$'8(&P0
M-@@;^08;Z;B1H(D.N^.P)Q2,!IRYHAS&U')\,DTY"N2Y*?U<?$;=PC&?//\"
(H`"7G.>7````
`
end

Example data for a dice:-

begin 644 dice1-0-0.xcf.gz
M'XL("/JV+EH"`V1I8V4Q+3`M,"YX8V8`[9L+<%Q5&<>_I)M'LVW2-&G:;9)F
MP^;=IDEWT\=ND\M#8$3!`15&$!Q*'R'2%WTXK8(MCJ#.@.`HB`\4'^`+$!^@
M*)2B@D4$101!`04%!A$0%12Q/?Z^N_<D)YL-;;-=9JGNS"][N_?N/><[W^.<
M_Y[;P:$UZZ-;EJ^*KAI:O5)$BM^11HJ@6M_U8#KT<Z*WF`\NX8_LTC\=Q?I9
MD7ZF5W7HGUWZSW/T[,-Z\3GZV</ZIP;*WW"9B,*K%D)0"77!L5XW@[<&WJ<-
MTK7NH37+!E=V#VX86B'I\T72L7'3UM4KHQO7K1Y:T5G1L6IP^;K5ZS9$._RW
M[@V#9RR+]L[O]5]9#A8$!YU\]8PL7UV0>>'8`_WJEHWKERT?6CL876#OS(=;
MLWT8?-:]>>W0INC0VN5GKMRHWU^W:M7&E9NB[K?'?I3^9/179>15]!M\=+A(
M]!J1WFDBJ:M'S@T[<A*4K%ZY:I/_:7"V-!APHTYQ!G]R<*RO!N>XPCD..\=3
MG..ISG%5<%_[*G/.6<=/<CI[84:G-;"NQS@BL+H1.F'AZ&M"Q\))\#WLBT('
M],(2.`S>",<#UY:>"D10Z0ZZLAZVP';X,%P"E\.5='D-7,<P'`TW`M=/O@/N
M@0?@47@2GF-("-J*`3@*CH,3X30@C2JX3\5F.!?.AXO@4K@"KH)KX0:X!6Z'
MN^%^>`2>@&?A1=C-<)<`8QLF,<*,1;@5NH'Q"--^F/;#S^,*$FS*K?!3^`40
M'%-^#T^!GO\G+GJ%?&/<J_!'%3ZJF@E-T`X]L!@.A6?(/\Y/NPEN@SOA7G@(
M'H.GX05X&;\0-]4:-[*-$-Q614@9_ZBB1Z:9#JDS,:DW49EC&J79S)86,U/:
MS0SI-#4RUU1+MZF2'E,IO6:*Q$U8$F:R+#3ELLB4RA)3(DD3DI29)$M-L?2;
M/=Q<=@0-B.PQ,B"E9HF$34(J3:]4FVZI-5TRT[1+Q+32<DP:3;,TF2:)FCER
MB&F0F*F75A.1-C-+.DP=/9DA7::6WM3(/#-=YIMI]*B*'E6F;=H?2@,X'I`B
MTR_%/DMEDDE)R"2EA-Z6FL52YK-(RLU"F>S3)Q58$?:)RQ2S0*;Z]/J659H>
MJ?*9S[@JW;ZU:>;)]&'F2LTP7?YHI.F4&6-0_V2CW1_!W`C&Q']?8*;BW2D^
M";S<9RKP\F2\7"Z+39G/$CR>]#U>@L=#>'R23S^>'S!%/IY&P(`?!A4FW4+)
M]G08^'^*S`C%PZ/?S^@O'2;D>R)-B>^1T93Z'AI+V;#7)LY$XDC'<CP_9?.I
MZW,W%MP8<6/'QI.-+QMO-OXT%FU<:HS:>-78M9;I".GHZ:CJ*-N15T]XZIIM
MDVY-&S9E>]J%FLWJ7\UN];EFNV:]9K]6`8T5K0I:';1*:+7P;U!VZ'`,]`=1
MH1%BHT4CQT:11I2-+HTTC3B-/(U`&XT:F9KIFO&:^5H!M!)H1=#*H!5"*X56
M#*T<6D&TDFA%T<JB%4;[I>,P7N1DBS(W"MWH=*/6C68;X]N8E"8211:MO%J!
MM1)K1=;*K!5:HTFC8%ME^CJMGIK'&D<:)QH3&@-JCW:CTLNI.!X0"NE^)7MP
MD[I0W6M+NRWIMHS;LFW+M*:H+;=M,HLACY@6F>VC3CE$&@BQ1M])37[(-1%Z
M4=]Q2@/A6.^'9<QW:,0/TU;"M<U'':W4^6'<D4Z@T$QZT,Y=&SA32VI5DA#E
M)$PQ<^KVH*CN](-.`U*#58.['(LJL$B#0`N#!H06#@T.+2I:9#18M#!IX,S"
M(@VBV8%%FB@-6*1),P>+F@*+-)F:`XOL]*RT8%%K8%%;8%%[8%%'8%$N'IOA
MIW>:6C_=T]3XZ9]FNE\.TF@1RF2:7S+&4N67D]P9[_[9^N+VU;7!M<VUV9V>
M.YTS7<XWYCIWFN>TT)VE!_/'Z6W/`1J-\>Z?K2]N7UT;7-M<FW.)(YM;-M=L
M[ME<U+RT.:KY:G-7\]CFM.:WYKGFN^:]K0%:#[0NV"6`G=+M%*[UQ$[5=GJV
MT[+6(:U'.GUH/J?7S-O)\B39GB#KNQF%=OK4R/VKC=8%O<;FELTUFWLV%S4O
M;8YJOMK<M9.BHOFM>:[YKGEO:X#6`ZT+6A^T3FB]T+JA]4/KB-83K2M:7[3.
M:+W1NJ/U1^N0UB.U9L^(QPX4Q8?YH(/GI?%?EP7HZ\T!^FH-R%3:96N&5JSP
M?TDI"(T=&CE=_=>Q&GLZUT7/`71P]!-PY>AK:M#@-=^'G;`+7N+6";@8/@F?
MAZ_`-P$=7HO^K$5_SK@+[B,UT*!U:-`Z-&@=&K0.#5K'>,RDKS/I_TR.9S$&
M$?1[!/T>0;]'T.\1_!!!OT?0[Q'Z$3D=Z&/DB_!U^#;\`'X$/X-?P6_A<?@S
M_`W^+3(;&V<SKK.GPVPX!+H@#BDX`MX$;X53X`QX-VR`]XK4,[[UZ.L&_-:`
MWQJJ(0+-T`D+("G2B`YOY+HYQ\`)<#(L@R$X&[;">2)-?+^)MJ+X-*ICC^W1
M%B#<HGW0#T?"L?!V.!56P&K8-*RQQV/JCD!Q9BI-=RFBJ_ET*8B-6@9T^:53
M5]Q6@^NZ6=?1KO8NWA-\>:>_0-6<U/RT:P(5$)K'F6L!NP[(G/]=2:YU)9>5
M6:94SEQKV?66*WVM5+5KK\SUEUV#V768)>I;,&?4NLRNS2R-PY6Q>=1Z+1OU
MPQ5UXKCS>>::*?-G#CN?NW.WG6=5#*DHRI3L5K9;Z6Y9Y,NV\E%2WLIYBQ5D
M+BE?^HW%"KE<R"6.[#R=C6P^=7WNQH(;(V[LV'BR\>7.]>Y\;^=\5]:[4M[.
M_YF2W<IU=RW@RG,KR[<5;4_;7+ESM")7]Z@+]:15U-G(YE37Z6XPN$'B!H\-
M*!M@KBIWE;E=`+H+/'=19Y5ZYH+%+E9<U>XN3NRB)'-!HD5,B]F!D(KV%PVM
MK=7^HJO+7UH-+[PJP_YU=M&C!54+:S%G=A_<8KIL7V>L?!?WM#">R(Q8'LR(
MH?S,B/FNY+FXK]!GQ'Q7<G?&+?09,=^%/)<X*I@9L2R8$<O'F1'W6LE#Z4I>
MZ+-:EIDFM`\S36%)[=(-0X-G;BH4I5TR<KKYE+%*N_G](CU_@+_`OT1Z0Z.O
MB:%.8ZC3&.HTACJ-H4YCJ-,8ZC2&,H^APEOX3@M];*'-%NQI03FV,'HM*,<6
ME&,K0]J*:FY#/;:A'MM0CVVHQS;48QL#U8;2;_N@2#M*NOT2D0[NV_%9^#)<
M`]^%F^$G\'-X$/XCTLGX=E9"'<R!-I@/B\"#H^$M<!*\"U;!6G@/8'?G!?!1
MP+N=GX.KX3J1+I1T%RI[+@IZ+MZ>^S9X)RR'LV`CO$]DWC;X$%PLTOUQ^`Q\
M";X!WX$?PH]%YO?"I8SO\7`ZG`GK80ML!^SNP>Z>R^%*^"I<#XQMST[8!;^$
M!_>JM,<C?"`V`=("^U5^I`\%/]*'#\R/]+FL_/*]85'H/]*[2X!\;W84^H_T
MN<11OC<["OU'>C>.\KW94>@_\.<21_XRKBK8[)!7V>RH3&]VZ-5[7E]:>K+1
M_??^/#]<E-:T^7J$JS1XA"OT&CS"E8='CR1X]*C*??1H(LXL]$>X\OWHD5OV
M['NA/\*5[P=])A)'K\TC7).#1[A*<WB$:Z^/'X4"ZXH*[S&OO3T2%4H_$C7\
MI%;5!)[4*BR-'5JW8D6A*.S2D=.]-X]5V+VHZQ2*,_4!0&VF/CWZFCCJ-X[Z
MC5\(*-\XRC>.\HVC?.,HWSCWC*-\XRC?^*_A8?@3H-CC_P"4<()^)+`A09\2
MV)M`@2<8VP0*/($"3QP)J.\$ZCN!^DXP>`G4=P+UG:#]!.TG'A?I:P+4:Q_J
MM0_UVH=Z[;L+[H/?P1]%%CX%SXLL^CN\(K(8?RQF+!?70#W$8"X\([*$/B11
MYTG4>1)UGD2=)U'G2=1Y$G6>1)TG4>=)U'D2=9Y$G2<9JR3J/(DZ3S)>2>(B
MB3I/HLZ3J//DC;`#[H![X`%X%)Z$Y^`EV,,XXY=4)=!^BO93M)^B_13MIV@_
M1?LIVD_1?HKV4[2?6CMAA7VP[H14Y&LG)+W/,I&=C9)@9Z.L,'8V<AG>@VVO
M?W]W-MSUU<&VU[^_.QNYQ-'K=J\_%.QLA">ZLU&4WMDXV/;WL^R$%.W^WWR`
M_?6Q*M7_P1@^8MGRLP8WK-N\MF`6I\4CIY>>G-%E^K+T7!'O:W`#W`9WC[YF
M(`*-T`R8/-`)W;``%D(2!N!P.`J.@>/@!#@1:'/@-%@&*V$(UL#9L!FV`GT8
M.`_.AX_`1?`QN!0^!5?`%^`JH*\#U\*W@#X/W`2W`'T?N!WN!&P8N!?NAX?@
M$7@,GH"GX5EX`5Z$EV$W=C,67@G@&X\Q]JIA!F"_A_T>]GO8[V&_A_T>]GO8
M[V&_A_T>]GO8[V&_A_T>]GO8[V&_A_T>]GO8[V&_A_T>]GO8[V&_IS[`?@_[
M/>SWL-_#?@_[/>SWL-_#?N^J8'%J]DKI/ESS_^M>_;I]H?@PL[\EX[_-=(5D
$V3X`````
`
end

Bug fixes and other examples may follow. Actually, bug fixes are most likely because there are multiple techniques for folding an origami cube and I don't think that I'm using the most common technique.

(Usual instructions for uudecode process.)

"I study liars. I’ve never seen one like President Trump."

Posted by DeathMonkey on Friday December 08 2017, @10:41PM (#2837)
22 Comments
News

I spent the first two decades of my career as a social scientist studying liars and their lies. I thought I had developed a sense of what to expect from them. Then along came President Trump. His lies are both more frequent and more malicious than ordinary people’s.

...

The college students in our research told an average of two lies a day, and the community members told one. (A more recent study of the lies 1,000 U. S. adults told in the previous 24 hours found that people told an average of 1.65 lies per day; the authors noted that 60 percent of the participants said they told no lies at all, while the top 5 percent of liars told nearly half of all the falsehoods in the study.) The most prolific liar among the students told an average of 6.6 lies a day. The biggest liar in the community sample told 4.3 lies in an average day.

In Trump’s first 298 days in office, however, he made 1,628 false or misleading claims or flip-flops, by The Post’s tally. That’s about six per day, far higher than the average rate in our studies. And of course, reporters have access to only a subset of Trump’s false statements — the ones he makes publicly — so unless he never stretches the truth in private, his actual rate of lying is almost certainly higher.

I study liars. I’ve never seen one like President Trump.

A $300 Bil. Mistake in the Handwritten Tax Bill, shocking!

Posted by DeathMonkey on Thursday December 07 2017, @02:55AM (#2829)
4 Comments
News

Senate Republicans Made a $289 Billion Mistake in the Handwritten Tax Bill They Passed at 2 a.m. Go Figure.

It appears that Senate Republicans managed to make a $289 billion or so mistake while furiously hand-scribbling edits onto the tax bill they passed in the wee hours of Saturday morning. The problem involves the corporate alternative minimum tax, which the GOP initially planned to repeal, but tossed back into their stew at the last second in order to raise some desperately needed revenue. The AMT is basically a parallel tax code meant to prevent companies from zeroing out their IRS bills. It doesn’t allow businesses to take as many tax breaks but, in theory, is also supposed to have a lower rate.

Except not under the Senate bill. When Mitch McConnell & co. revived the AMT, they absentmindedly left it at its current rate of 20 percent, the same as the new, lower rate of the corporate income tax that the bill included. As a result, many companies won’t be able to use tax breaks that were supposed to be preserved in the legislation, including the extremely popular credit for research and development costs. Corporate accountants started freaking out about this over the weekend, but the situation reached high farce when a group of lawyers from Davis Polk pointed out that, by leaving the AMT intact, Republicans had essentially undermined their bill’s most important changes to the international tax code.

Jumped sharks sector surging in Venezuela

Posted by khallow on Monday December 04 2017, @06:16AM (#2818)
13 Comments
Rehash
Venezuela continues to find sharks to jump with this latest bit of news.

Venezuelan President Nicolas Maduro looked to the world of digital currency to circumvent U.S.-led financial sanctions, announcing on Sunday the launch of the “petro” backed by oil reserves to shore up a collapsed economy.

The leftist leader offered few specifics about the currency launch or how the struggling OPEC member would pull off such a feat, but he declared to cheers that “the 21st century has arrived!”

“Venezuela will create a cryptocurrency,” backed by oil, gas, gold and diamond reserves, Maduro said in his regular Sunday televised broadcast, a five-hour showcase of Christmas songs and dancing.

Didn't Venezuela used to have a fiat currency backed by oil? Why would anyone believe that this new fiat currency, despite having cryptocurrency cooties, will fare any better? I bet that the public nature of the block chain is a significant part of the reason they went with this scheme. You know to prevent the people of Venezuela from buying the things they need on the black market. Like anyone would use the "petro" for that anyway.

And we may also have a sign that the bubble of cryptocurrencies is reaching an unsustainable high, if we have deadbeat dictators grasping at this particular bit of straw.

Auschwitz Notes from Hell

Posted by turgid on Saturday December 02 2017, @09:36AM (#2816)
9 Comments
Topics

The BBC has a story about the discovery of a diary left by an Auschwitz victim.

Marcel Nadjari was a Greek Jew interned at Auschwitz and forced by the Nazis to escort fellow Jews to the gas chambers, burn the bodies, collect gold fillings, women's hair and to dispose of their ashes in a nearby river.

Realising that their own murder was only a matter of time, the Jewish slaves of the Sonderkommando documented their experiences secretly. Nadjari put his manuscripts in a thermos flask in a leather pouch and buried it near Crematorium III.

The flask was discovered by a Polish forestry student 36 years after it was buried.

Nadjari survived Auschwitz, but his parents and sister were murdered in another camp.