Stories
Slash Boxes
Comments

SoylentNews is people

Log In

Log In

Create Account  |  Retrieve Password


cafebabe (894)

cafebabe
(email not shown publicly)

Journal of cafebabe (894)

The Fine Print: The following are owned by whoever posted them. We are not responsible for them in any way.
Tuesday December 19, 17
09:17 PM
Hardware

(This is the 54th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

It is a terrible situation when a person's best vaporware falls behind a shipping product. In my case, my vaporware fell behind the Xtensa processor architecture and I've spent at least a week rectifying this situation. I hope this isn't entirely idle work but there is definite satisfaction when devising something which might squeak ahead of a competitive product.

However, my objectives for processor design may not be obvious so I list them here:-

  • Design a practical processor architecture which can be implemented. My previous design (with three source datapaths and two destination datapaths) does not scale to the extent of other architectures.

  • Design a reliable processor architecture. I'm not following the classic telephone company philosophy of ensuring that every component exceeds 99.999999% reliability (with the intention that a path through a system exceeds 99.9999% reliability). However, if there is an obvious or non-obvious method of greatly increasing security or reliability then it should be taken. An example is watchdog timer functionality. An embedded system with a watchdog timer may be significantly more reliabile than a system without a watchdog timer.

  • Design a scalable processor architecture. As an example, the Alpha AXP processor architecture was intended to provide 1000 times the performance of a DEC VAX. 10 times performance was expected via scale of integration and associated increase in processor frequency. 10 times performance was expected via clean implementation to aid super-scalar design and suchlike. 10 times performance was expected via multiple processors sharing the same banks of memory. In another example, compiled code for 16 bit stack-based a Transputer was upwardly compatible with a 32 bit Transputer. Indeed, the Transputer processor architecture was very much like a Burroughs B6000 transported into the era of the 80286. Like its spiritual mini-computer predecessor, Transputer implementations are very interchangable because code utilizes very few architectural quirks.

    It would be desirable to implement a processor architecture which works as a simple 16 bit embedded processor with minimal energy consumption and also works as a 64 bit (or more) architecture in clusters of 4096 nodes (or more). Clustering requires very good MTBF otherwise a system will crash or produce junk results very frequently. Likewise, if there are a significant number of nodes scattered like dust then reliability may also be problematic. In the worst case, sensors behind enemy lines or poured into concrete may be very difficult to replace. Even if it is easy to perform maintenance, it is cost effective to have a US$60000 per year technician who reboots lightbulbs?

  • Design a secure processor architecture. Popek And Goldberg virtualization requirements are not negotiable. AMD agrees but Intel thinks that your security is a matter of market segmentation. Overall, techniques are strongly encouraged if they reduce or eliminate buffer overflow, heap overflow, stack execution, gadget execution or allow malicious code transformation into 7 bit ASCII, UTF-8 or UCS2. That may require multiple stacks, tagged memory and/or encrypted instruction streams. I'm particularly reluctant about abuse of the latter but Bruce Scheier (in the book Applied Cryptography?) states that it is better to have 10% of the resources of a trusted computer rather than 100% of the resources of an untrusted computer. Unfortunately, "It is difficult to get a man to understand something, when his salary depends on his not understanding it!" and many people don't understand that trust comes from the user and is for the benefit of the user.

  • Design a clean processor architecture. This encourages design of practical, reliable, scalable, secure processor architecture while also allowing upward compatibility. If there is a "cool" feature then it should be avoided because it may be an architectural dead-end.

    My preference is for a RISC processor architecture with flat memory, fixed length instructions, 3-address machine, general purpose registers which exclude program stack and program counter, a dedicated zero register, no flag register, no shadow registers, separate program and data stacks, register rotation, optional FPU, no SIMD and no vector pipeline. The primary reason to deviate from these choices is to increase code density. This is particularly important to maintain competitiveness. Without competitive code density, instruction storage, caching and throughput is greatly reduced. There is little to be gained by saving a few transistors inside of an ALU if millions of additional transistors are required outside of an ALU. Other reasons to deviate from a clean processor architecture include leveraging existing design elements, greatly increased throughput, compatibility with legacy code and/or maintaining the expectations of a general purpose computer.

  • Design for real applications not benchmarks. If a design works well with a mixed workload then it should perform moderately or better with a parallel workload. However, if a design works well with a parallel workload then it may not perform well with a mixed workload. As an example, GPUs work well with parallel tasks, such as graphics, but are otherwise slower than general purpose computers. In general, it should be preferable to add more processor cores rather than add more ALUs, more FPUs or more vector units.

    A scale-out version of the previous design provided one hyper-thread and eight or more ALUs. However, utilization of ALUs is grossly inefficient. Across multiple architectures, approximately 1/2 of instructions executed are move operations and 1/6 of instructions are branch instructions. Therefore, in the general case, 1/3 of instructions are ALU or other instructions. Scale-out version of the next design provides eight hyper-threads, one ALU and supplemental units. If there is one active thread then performance is almost equal to the previous design. However, if there are eight or more active threads then the ALU approaches 100% utilization. Per transistor, this is more than eight times more efficient. In general, it is preferable to be bounded by computation rather than memory bandwidth because computational units may be more easily duplicated when memory bandwidth is not exhausted. (This also applies on a larger scale. Specifically, it is easier to scale application servers rather than database servers because application servers may not share any state.)

  • Implement old ideas. This is especially true if patents have expired.

  • Use the best elements from other architectures. When appropriate, I have mentioned processor architectures such as RCA1802, TMS9900, 6502, Z80, MC68000, x86, AVR, ARM, MIPS and Xtensa. To a lesser extent, I have also been influenced by IBM System 360, Data General Nova, Burroughs B6000, Transputer, XMos, PDP-11, VAX, Alpha AXP, PIC, Xerox Alto, TMS32020, MC88000, SPARC, PowerPC, Intel 432, Intel 860, Intel Itanium, MIL-STD-1750, FirePath, MMIX, NandToTetris, RISC-V and a mad balanced ternary design by the Russians.

    It is possible to find similarities between processor architectures. For example, there are similarities between instruction fields on Z80 and MC68000. Or similarities between the three 16 bit index registers on Z80 and AVR. Or the pattern of one data register, one address register on NandToTetris, 2 data registers, 2 address registers on a Data General Nova and eight data registers, eight address registers on MC68000. Or the hybrid of MC68000, x86 and RISC available in XMos.

    It is also possible to synthesize ideas. For example, a 3-address processor architecture with eight or more stacks may be feasible. Or consider an SIMD version of MC68000 where ADD.B performs four 8 bit additions rather than modifying the bottom 8 bits of a 32 bit register.

  • Don't guess. Use empirical reasearch. I've comprehensively shown that it is practical to restrict subroutine calls, loops and forward branches to instruction cache-line boundaries. Indeed, even on architectures which do not enforce this restriction, it is beneficial to do so. Specifically, MySQL Server natively compiled with gcc on a Raspberry Pi with 64 byte boundaries leads to stored procedure execution times being reduced by 45%. An architecture which pre-scales addresses would obtain better performance (and possibly better security).

    It is tedious but possible to simulate queues, memory access patterns and other resource contention. Indeed, it is possible to simulate a hyper-threaded processor running real applications, such as a word processor or web browser. From this, it is possible to collect statistics, such as instruction length, opcode frequency, conditional branch frequency, register usage, addressing mode frequency, cache hits, ALU contention, FPU contention and memory contention. It would also be possible to collect more detailed statistics, such as determining the most frequently executed instruction before or after usage of a particular opcode, register or addressing mode. It is also possible to collect cycle accurate timings through particular sequences of code.

  • Avoid side-effects and conflation. This can be difficult because established practice become ingrained. A good example would be a flag register. This is state which lingers between instructions. Furthermore, modifications to flags occur as a side-effect of instruction execution. This may create difficulties for super-scalar implementation. It may also increases the duration of context switching. A further difficulty with flags can be found by comparing MC68000 against MC68010. The minor step revision correctly enforces privileges on half of the 16 bit flag register. This allows virtualization to work correctly.

    A practice discontinued after ARMv1 or so was the conflation of program counter and user flags. Furthermore, RISCOS continued the practice (from Acorn's 8 bit computers) where system calls returned information in the carry flag and therefore it was possible to make a system call and then conditionally branch. In the kernel, this required setting or clearing the bottom bit of element zero of the stack. Although this conflation eases context switching, it greatly reduces the address-space for executable code. This was amortized through the observation that aligned, 4 byte instructions don't require the program counter to hold anything useful in the bottom two bits. Unfortunately, this observation is incompatible with 2 byte Thumb instructions and 1 byte Jazelle instructions.

  • Avoid fragmentation. ARM, Java and Android are particularly bad in this regard - and there is a strong correlation between these architectures.

There may be further conflicting constraints.

Tuesday December 12, 17
08:42 PM
Code

For Christmas 2016, I released a rotating torus and fractal animation. Admittedly, this was test data for a video codec but it was also a fun and frivolous software. For Christmas 2017, I've automated the algorithm for converting 1, 2, 3 or 6 bitmaps into a bitmap which can be folded into origami cube with the pictures on all sides. This has a variety of uses including corporate branding; like an unusual business card. It is also great for making unique Christmas decorations. Indeed, I've now got some really strange Christmas decorations including cubes of:-

  • A six sided dice.
  • James Bond actors.
  • Clover, Sam and Alex from Totally Spies.
  • Pinky Pie from My Little Pony: Friendship Is Magic.
  • Bob Dobbs, the Cheshire Cat from Disney's version of Alice In Wonderland and a hexagonal toroid from Mathematica.
  • The Ford Escort RS2000 from Fast And Furious 6.
  • Characters from Wingi's BunniSpace.
  • Goatse.
  • Tubgirl.

The program uses the current verion of the generic Makefile for C programs using libgd. This works with a local copy of libgd-2.2.2 and libgd-2.2.5. Like previous programs, it also requires a local copy of libpng and zlib. On Linux, via bash, invocation is as follows:-

export LD_LIBRARY_PATH=$PWD/build/lib:$LD_LIBRARY_PATH
./origami-cube 1.png 2.png 3.png 4.png 5.png 6.png output.png

If you don't have six images, it also works as follows:-

./origami-cube 1.png 1.png 1.png 1.png 1.png 1.png output.png
./origami-cube 1.png 1.png 1.png 2.png 2.png 2.png output.png
./origami-cube 1.png 2.png 3.png 3.png 2.png 1.png output.png

Further documentation is inside the archive:-

begin 644 origami-cube-dist-20171211-204606.tar.gz
M'XL("([N+EH"`V]R:6=A;6DM8W5B92UD:7-T+3(P,3<Q,C$Q+3(P-#8P-BYT
M87(`[1MI<]NV,E_+7X$F<2TYHL1#E&RGRHSC'/74.<9QYM6O3A.*A&2.*5+E
M(<MYROOM;Q<`Q4.D#CO-)/.,3$T<NXO=Q6*Q"ZA^X`S-D2-;<9_*MA-&LJ:H
M75535:BT.TJG=?3ZW>G!\?&]FQ<%2J?3QJ_:-93LEQ55[=Y3=4735<W0$4[5
MNYW./:+<^P8E#B,S(.3>_VF)+IR0A/X@NC(#2L(+/W9MTJ<DB#TR"/P1B2XH
ML>(@H%Y$;">@5N0'UR0PH3^`0=,CC@<J=%UJ-R7IWEWYL8J_<O^_,B_IP''I
M+?=_NVK_JXK&]K^FZH;1-A!.;;>5]MW^_Q;E02L.@U;?\5K4FY`1K+4D/2!#
MZM'`L4BR]F3@!V3DVS%4`_IW[`2.-R2NTQ_:`%T[K(/1=.0N.05G<>A[H1]$
M3CQJ2@]@%(?0G@@J.N(=8&"*3D)S0N=M!,"V].K@]^<]QLBSHW>GO9/G!\]>
M/2?B%$HYVFE:THLW;WIT.H;9R-O?7WX\?//ZQ='+'I-F8+HA);_\0L3XX8OC
M@Y?O>MORT</:VW\]J[?ZL>/:+<>SW-BF1#[.=8-D1'ZCDX<UCE??SI`Z?I;0
M6D1Z6!.C!8R/QT=/3PY.SCZ^/3C]K;==Q-M'O!Q,?5N2P*GND\#W(TERO(@.
M`R>Z_N@H^])/`\>SB1Q=CV%EB!PZGREYI!!Y#,L2*61&IF8P#(FL@$,WPWA$
MG@!CKP]>/:__5_K),J-Y"T!QK8A\2;042/HI&*4(Z=PPL1`(AWKS?OF3;4;`
MP=;9UFC+EK=^VWJU]>Y3,YI&J(.'-5S1.LF*`+(%UD='!8HURR;-)@*.+N%\
M(:";9T<G=?EA[=W[%TK]8>WT"-C8!J:L,4I(/KD@F4T0[)U2)]H3TK+II.7%
MKON);#>;K7("V9EPQ\NR%X_0R&7_"HR=R!:1!Q6S-P%ANXJSUDZ=S&;DO#B!
M=4MJS(+94LC!8*E@P\\.J&:/\5\%QB?EE/L`KVV",/V\"704Q%0LL)*:#"#U
M[G\:7]EH<]0FV^&TN=.:3K?3-C]__E3DO0_LS\Z?BOHA:>J\AL/:O-.8USJ\
MEJ=W6U*?[F<V,<H)$I29^OVLG7/+9@I(I4==]3+$T'9[VSO;"XBP-5`+!4RF
MF`5TM@5.ZZ5$+)>:'E"1A?TP5\.;G\'AI`/8DA,=R7M-+CP?AZ&Q-TQA>7L)
M]-#.`0_M*ECT:DT_K4L2G4:!:47<9CYS)W5P<IC?X+"VN+6FX/0*_<SRAC?$
MZ]\0;WI#/+ZU$Y'3M08ZO;E[>_7F&;C9.JN\/7E>YYK$C=8<?EXROG00D/N?
MM6J`_BKLZ9+Q*2`O:,%QB:Q^RIIHNM*2A-:'AX!0`"?<P]Z,M?-9>G()$3@6
M:@]K$`?4<1#\KU!?N547&&JV+-\;.,,8,A\9CA8(+*:][-%\*^H)IVPG?@U*
M7X6;B(;15R$DLCZ@Y7H0@JQ/@[D?B;N2?>&,"NO/!V]G`56^:E,;@-XK)[J0
MF6`;V,B:\Z]E)9O1^DH<+;.4#4E5V,HZ5,01)+'39']^(BT:S-"^M;V4'5>W
M,9?2`6!_7?M9@Y]US6<-4DFN<O3T[6O(I3;.ER"%R:+F$R`^`I_*M"F%R`4T
M7T<]*VQY`TK5IKR*"`^/)(QV]GF#1T%6GB](0RI4;@EX(OLB?JHOH"ZJED,2
MV851%R(S%V)Y%[VK[`('@E0=^(H]D6QF:4;.B(+1H_Z:X<4\T4D!EHT)9#9!
M&69V0(1$=Q=RW]W]'[^`N<T<*^[_U:ZNB/M__*?A_7]',>[N_[[)_=_\KD]<
MM*UST[?./9^$A/*O"Y%/KOS@DI@ANPSR;&KORY+D$_+R]7MV]YAY@("S-<0W
M!].SB<6>%_CMI(`&9[<4UK(0TJ8#,W8C`*60CJ0(?3.\P'%\W!CXKNM?H926
M/QH!B7"1,K!)"`$$-Q05D%'46`X(LK+D4_2-KVQ1"VE2PXL#G#/TX\"BF*A?
M.!,:,H5CM-`0P0V3@I\/%1+Z`3[#-"5I[)I`B46F?S3/FO].DD/'6WRR:4I0
M`0&9VLF$!J'C>P26AZ&K3:VY*]`3NB(\NSEE04!M=II:9Y$X')BWH@WX6A/_
M"<IIESXG&9+8<YU+ZEZC]0&E,=X=A[%ET3`<0(X*,SA`F]H-T.WU?`9?W%_P
MJ1+\`9[E>-Z!`0]@X8#3,!ZS"`?@!JYY">2D="*T[@?B4AUZZ91:<93IQ5<V
MO%]E(849"=$\?TX6W]3F^RB@9NA[P"=)8]#0"IQQE&"-`W_BV!L]Q*WV_SSH
MN+7_KWS_T;1V-WG_Z715A;W_J(IVY_^_16GMD/>1XSK1-3GUR2'86$3)4R<:
MF6/R`BSO#;</<@CV079:$L#S0V#AL0='<3A]WL%S0.#DWG@8Y(,DKOTUO`Y;
M`!LU+YX4>O&!(\QW#VULXYL$[*!I#;^>TF`?M4[^`[[6&=0\Y4G2(K!MHCCP
MH*_^&-I?I+1'A9XOC&?8]M>PST<.[#@(@B.^%_GY!SW@F?84.$R&`:70L`(*
M#@F<`,#TN:HN/?_*0Q<!6@K_CN&T:PK)1^8U[DL[!J=W`9[;Z;.3-81CB7D1
M00"V-&0*7D@@2?7",1#PK&NNTXGO"*9H;6@?C<PAW2'.2&ED&BI30:`(%4#=
M51JN^E@T+"6IA5.LA=.>0'[W1PVPF6I"R$^MBUI"A,!1&E*B[(L6P9.JYBH]
MY;&K_`IT7.71H_I\4`RK.*RR834_#/249-:7-'KK3*F+<S<8IXR#I"2\I5`*
MAVI82@;PBY3_]L%Z+_GPEU0`]3L1()S*JNPRZ!M(H7U74J32W$`4_7NQ*'4N
MS692B*@R%:,`D?J8Q+^,:#"D!/R"#;L;'0=&#W$`&Q>8"LG5A6-=$`R1LWX$
M@]/0A!0:WW<3=R)"P@BC!T8"(,8NT,#($'QK``2%>R']&/[S8\\VP;O1(/`#
MF,J!0!A<&&":Z$&=43Q"=DQR8;H#,D;M<-B,\V'L+_$]F8:6;>C91OLVOFF)
MH2PU$C@,7.6)JV;-!OI@W1>ZU[:F+X3B;PM68FJ+F%(%A<U8:M^8);V:)6EM
MUUMBXOPX=CQV'IO!T&I8%Q!5[>Q`?<(E$H3)#E!\G&^KA;96:.N%=KO0-@KM
M3J'=+;1W"^V](C\9BVR$UX\E'E6@7+W>_'0<L%];P,)%-NR7QGU(,<<_$V=[
MA%&]A_DD^_V<Z9&#$<10Q($C_N=S[[[0(ITZ44V=QR2"_L^]W2KZ6^$^B4-@
M<9]L80(]CB.UB?DBJVII54^K[;1JI-4.J_IQ!'6L`E,-7*@_E0_)-\NDEC(Y
M4A.CXI'B"Y#P!20Z-8:F<C18P&506@*E+X/2$ZCV,JAV`F4L@S(2J,XRJ(Z`
M&M1PU_=ZK]\?']=G,]S(V8:>;;2S#2/;Z(C&LM5D;G8?0C_TZ)B_`5?@Q-%!
MLW.AF5F8[(+H8D'@#YP([+=`+H!!7@J),T2FF,$R[\V<*8;)>8?:*'9IBUWZ
M8E=[L<NH-[+-3IT5YL6O<Q.?+4Y\MCCQV>+$9XL3G^4G/BM,S"5F.U<LNY)?
M]M,@IH>^"\=&.-W1&OA'`';7!=Q=%W!O74!U&9.["+E;?RQ6/!Z'E@E9!+,4
MW-"A6.V$@C^^/J$\,+"9!\=C3!'_YC,W"E:17ZMZQBD^`],*_.OY:;S@!ZKD
MRC#$CE+D98&/(F21=6T%ZUJ>=:V"=:W<.:W'NG8SUO45K.MYUO4*UO5RC[D>
MZ_K-6&^O8+V=9[U=P7J[W(VOQWK[9JP;*U@O.`^C@G6C_&Q9CW7C9JQW5K#>
M6?![I:QWR@^\]5CO5++.W0]>;OJA$U'A?(CO0<+"PPEQG2'\$0"/S2BB@4>N
MS!#R)JB.'(_:A([&3N#@7?TUZ5_C';R-$=/8'-.`7X";?>JZV#=Q0G9G,C`M
M.O=T>.[Y`:&83T+V$H>T(9!Q(GY98B/E^;U-B/F4-Y\I/P=#HF/4D9W.P&Y=
M+::YD`#?,"'>O+J.!4*;$5[$>Q1R*LRS`-+'_R7BR@&.V'42@$ZHY^!U-F,5
M)@0V@XP$-@W9]5*B(Z!AAN(](MR76?[%`!_)O"1?></VH_F4,W).-/(7?+'\
MBH.L-DOR@;](IIT=GP%>"_ZFE)#6;(ZI'V<Q2>?I+-<F[?GXA-')4IH0E='*
MEUFALCC>`KQ)AM+7U%,RQWE&+ZT2'LXSK+4R>(N4",)JIS.$(Z72POA,>YH?
MSU+BZS%C@PC;(J)17$O@]GS"Z/Q%,G@E/,W:)X3\@<LW*^5IANN&XR<S4B$=
M<@Q?,=\YJR606H82C`N^^;AV6J6G%A+EL*5Z0KD1^YR0I1IO90SGO-1^".,V
M/_[/VE.%C:]HS_X92K/RO;[*%R1X64J=X\HYU>H=#'@G!4J3G-=8X0OTM'OR
M#^KI*U'Z`?R3L<H_&4O\$Z+/_0_W3W^ED$;.%PC_)(S)>$JJ_%.)_R$K_%>6
M$NKM2>)_N'^:0_*QU'Z$?V+^*\&K\$_&*O]D_&#^:0;;M'4#7W#.MG?^-&_-
M(PQ<N'Q<<%J,"^;C2&N2H]0"FYG=("Z8`-[YU]83)Q7ZKF.3D%KX0EB2=V.:
MR_-MB)EU_F<Q`Q"O>4E^JY<ERTJ#O[?LR(R6NCR1R&'HFR)TJQ!RC.HK&%4%
MM^O/:U0I:.D4W56,ME<P:FS,J+HIH\9:C!IK+;VQZ=(;FR[]F@B=^0SK*B_%
M6*F+3D-;0Q?ZIJ+I%5FL[;"?RA2W\IRE+MN9!3YWV2XH=.XQB^.=_*4LV=<<
M`;][JZVK3*-9<MT&O\O!;WNU[2W=S5UF>XMS&&(.8^4<E8999'E7?/=6LVQ4
M7CBP/+MLA=2R%=+*5JA,W*Y8'<;G&L8G;^Q3Y756U5A+17*E[UO*0;="J;G[
M&O;C-L>E^?.,O8TP@NP1HON!(R_<-BG+[HHK[F$K[C@K[@\K[N8J[KT6NKOE
MW;OEW7L5XG`QQ8^'%/[;H;L?D-^5NW)7[LI=N2MWY:[<E1^J_`\]G.O'`%``
!````
`
end

Debug test data which should be saved as test1.png and suchlike:-

begin 644 test0-0-0.xcf.gz
M'XL(""JI(UH"`W1E<W0P+3`M,"YX8V8`[=UM2%-1`,;QLVEJIK7,1$+BHE$S
MG#@SB9(0&Q$2%:$1$=G<FZN]R%Y$(]*0%A%1$C$B(B)"(B(B1"(B(B)"(B0B
M0B(B)"(BHB0BPNY==^S>;=^>+Z,]%^YV=L[^4[Q^^G&V>;S^?FG0X9;<7I]+
MR$>[>BK'4ODT&.2;"OFL5!:4!T*Y:3<J(^5FF7R6=)P30CG5YQ4J<_)9I8Z5
M8KE\URK?EWOD'VEQ!/U^5R"2?+G$ZYH2*UZ_W>.R>$)>I[IH$.9P9,CGDL)!
MG]=97VIV>QQ!7S`DF1-WEI"GURXU-38ECBP#JSJHE]/>+*DU_8F9`R4=#/?;
M'=Z`1[(F7UF>',HVJ<Y9H@%O1/(&''VNL-('W>ZP*R)IZ\RI?S/Z5*0.P[00
MM1-";!D3HBL@1+13B-@:(>*%FB<E+Z'R1S6V)J^3<A2IUV->O3[):[-0'2M'
MC69<JADOTHS+-.-RS7B)^KK)HUBSIOV_4`_C:-IO+*\;X_HYX[CR8-@DQ(A)
MS/_Y_>OGW/=O7[]\_O1Q]L/[=V]GWKQ^]7+ZQ?.I9T^?/'[T\,']>Y,3=^_<
MOG7SQOCU:U>O7+YT,7[A_-C9,Z=/G3PQ>AS,1\#\&)@?!?,C8#X(Y@-@'@'S
M$)CW@WD`S'U@?@C,^\#<#>9.,.\%\X-@?@#,]X/Y/C#?"^9[P+P+S'>#^2XP
MWP'FV\&\$\RW@?E6,+>!>0>8MX/Y9C!O`_.-8+X!S%O!O`7,F\&\"<P;P;P!
MS->"N1G,5X/Y*C"O!7,)S%>">0V8KP#S:C"O`O-*,*\`<Q.8+P;S,C`O!?,2
M,"\"\T(2``F`!$`"(`&0`$@`)``2``F`!$`"(`&0`$@`>4$`PZ;$[@7='H?U
MN;*WH3BU7->3N;>A+JR?JTOL?R!J$#6(&D0-H@91@ZA!U"!J$#6(&D0-H@91
M@ZB1+ZA!!B`#D`'(`&0`,@`9@`Q`!B`#D`'(`&0`,@`9('_W-K3DRMZ&HM2R
M;5/FW@;;3OV<K2?YN0W$#>(&<8.X0=P@;A`WB!O$#>(&<8.X0=P@;A`W\AHW
MUN4*;BQ(+7=79^)&=X-^KKN-;]P@:A`UB!I$#:(&48.H0=0@:A`UB!I$#:(&
M48,?2$D"(`&0`$@`)``2``F`!$`"(`&0`$@`)``2``G@?_OLAFQ[&YIS96^#
MYJM!HS\R]S8,E.CG!JKYA9O$#>(&<8.X0=P@;A`WB!O$#>(&<8.X0=P@;A`W
MA#!:<P4W"E++L>E,W(C-ZN=B<WSC!F&#L$'8(&P0-@@;A`W"!F&#L$'8(&P0
M-@@;^08;Z;B1H(D.N^.P)Q2,!IRYHAS&U')\,DTY"N2Y*?U<?$;=PC&?//\"
(H`"7G.>7````
`
end

Example data for a dice:-

begin 644 dice1-0-0.xcf.gz
M'XL("/JV+EH"`V1I8V4Q+3`M,"YX8V8`[9L+<%Q5&<>_I)M'LVW2-&G:;9)F
MP^;=IDEWT\=ND\M#8$3!`15&$!Q*'R'2%WTXK8(MCJ#.@.`HB`\4'^`+$!^@
M*)2B@D4$101!`04%!A$0%12Q/?Z^N_<D)YL-;;-=9JGNS"][N_?N/><[W^.<
M_Y[;P:$UZZ-;EJ^*KAI:O5)$BM^11HJ@6M_U8#KT<Z*WF`\NX8_LTC\=Q?I9
MD7ZF5W7HGUWZSW/T[,-Z\3GZV</ZIP;*WW"9B,*K%D)0"77!L5XW@[<&WJ<-
MTK7NH37+!E=V#VX86B'I\T72L7'3UM4KHQO7K1Y:T5G1L6IP^;K5ZS9$._RW
M[@V#9RR+]L[O]5]9#A8$!YU\]8PL7UV0>>'8`_WJEHWKERT?6CL876#OS(=;
MLWT8?-:]>>W0INC0VN5GKMRHWU^W:M7&E9NB[K?'?I3^9/179>15]!M\=+A(
M]!J1WFDBJ:M'S@T[<A*4K%ZY:I/_:7"V-!APHTYQ!G]R<*RO!N>XPCD..\=3
MG..ISG%5<%_[*G/.6<=/<CI[84:G-;"NQS@BL+H1.F'AZ&M"Q\))\#WLBT('
M],(2.`S>",<#UY:>"D10Z0ZZLAZVP';X,%P"E\.5='D-7,<P'`TW`M=/O@/N
M@0?@47@2GF-("-J*`3@*CH,3X30@C2JX3\5F.!?.AXO@4K@"KH)KX0:X!6Z'
MN^%^>`2>@&?A1=C-<)<`8QLF,<*,1;@5NH'Q"--^F/;#S^,*$FS*K?!3^`40
M'%-^#T^!GO\G+GJ%?&/<J_!'%3ZJF@E-T`X]L!@.A6?(/\Y/NPEN@SOA7G@(
M'H.GX05X&;\0-]4:-[*-$-Q614@9_ZBB1Z:9#JDS,:DW49EC&J79S)86,U/:
MS0SI-#4RUU1+MZF2'E,IO6:*Q$U8$F:R+#3ELLB4RA)3(DD3DI29)$M-L?2;
M/=Q<=@0-B.PQ,B"E9HF$34(J3:]4FVZI-5TRT[1+Q+32<DP:3;,TF2:)FCER
MB&F0F*F75A.1-C-+.DP=/9DA7::6WM3(/#-=YIMI]*B*'E6F;=H?2@,X'I`B
MTR_%/DMEDDE)R"2EA-Z6FL52YK-(RLU"F>S3)Q58$?:)RQ2S0*;Z]/J659H>
MJ?*9S[@JW;ZU:>;)]&'F2LTP7?YHI.F4&6-0_V2CW1_!W`C&Q']?8*;BW2D^
M";S<9RKP\F2\7"Z+39G/$CR>]#U>@L=#>'R23S^>'S!%/IY&P(`?!A4FW4+)
M]G08^'^*S`C%PZ/?S^@O'2;D>R)-B>^1T93Z'AI+V;#7)LY$XDC'<CP_9?.I
MZW,W%MP8<6/'QI.-+QMO-OXT%FU<:HS:>-78M9;I".GHZ:CJ*-N15T]XZIIM
MDVY-&S9E>]J%FLWJ7\UN];EFNV:]9K]6`8T5K0I:';1*:+7P;U!VZ'`,]`=1
MH1%BHT4CQT:11I2-+HTTC3B-/(U`&XT:F9KIFO&:^5H!M!)H1=#*H!5"*X56
M#*T<6D&TDFA%T<JB%4;[I>,P7N1DBS(W"MWH=*/6C68;X]N8E"8211:MO%J!
MM1)K1=;*K!5:HTFC8%ME^CJMGIK'&D<:)QH3&@-JCW:CTLNI.!X0"NE^)7MP
MD[I0W6M+NRWIMHS;LFW+M*:H+;=M,HLACY@6F>VC3CE$&@BQ1M])37[(-1%Z
M4=]Q2@/A6.^'9<QW:,0/TU;"M<U'':W4^6'<D4Z@T$QZT,Y=&SA32VI5DA#E
M)$PQ<^KVH*CN](-.`U*#58.['(LJL$B#0`N#!H06#@T.+2I:9#18M#!IX,S"
M(@VBV8%%FB@-6*1),P>+F@*+-)F:`XOL]*RT8%%K8%%;8%%[8%%'8%$N'IOA
MIW>:6C_=T]3XZ9]FNE\.TF@1RF2:7S+&4N67D]P9[_[9^N+VU;7!M<VUV9V>
M.YTS7<XWYCIWFN>TT)VE!_/'Z6W/`1J-\>Z?K2]N7UT;7-M<FW.)(YM;-M=L
M[ME<U+RT.:KY:G-7\]CFM.:WYKGFN^:]K0%:#[0NV"6`G=+M%*[UQ$[5=GJV
MT[+6(:U'.GUH/J?7S-O)\B39GB#KNQF%=OK4R/VKC=8%O<;FELTUFWLV%S4O
M;8YJOMK<M9.BHOFM>:[YKGEO:X#6`ZT+6A^T3FB]T+JA]4/KB-83K2M:7[3.
M:+W1NJ/U1^N0UB.U9L^(QPX4Q8?YH(/GI?%?EP7HZ\T!^FH-R%3:96N&5JSP
M?TDI"(T=&CE=_=>Q&GLZUT7/`71P]!-PY>AK:M#@-=^'G;`+7N+6";@8/@F?
MAZ_`-P$=7HO^K$5_SK@+[B,UT*!U:-`Z-&@=&K0.#5K'>,RDKS/I_TR.9S$&
M$?1[!/T>0;]'T.\1_!!!OT?0[Q'Z$3D=Z&/DB_!U^#;\`'X$/X-?P6_A<?@S
M_`W^+3(;&V<SKK.GPVPX!+H@#BDX`MX$;X53X`QX-VR`]XK4,[[UZ.L&_-:`
MWQJJ(0+-T`D+("G2B`YOY+HYQ\`)<#(L@R$X&[;">2)-?+^)MJ+X-*ICC^W1
M%B#<HGW0#T?"L?!V.!56P&K8-*RQQV/JCD!Q9BI-=RFBJ_ET*8B-6@9T^:53
M5]Q6@^NZ6=?1KO8NWA-\>:>_0-6<U/RT:P(5$)K'F6L!NP[(G/]=2:YU)9>5
M6:94SEQKV?66*WVM5+5KK\SUEUV#V768)>I;,&?4NLRNS2R-PY6Q>=1Z+1OU
MPQ5UXKCS>>::*?-G#CN?NW.WG6=5#*DHRI3L5K9;Z6Y9Y,NV\E%2WLIYBQ5D
M+BE?^HW%"KE<R"6.[#R=C6P^=7WNQH(;(V[LV'BR\>7.]>Y\;^=\5]:[4M[.
M_YF2W<IU=RW@RG,KR[<5;4_;7+ESM")7]Z@+]:15U-G(YE37Z6XPN$'B!H\-
M*!M@KBIWE;E=`+H+/'=19Y5ZYH+%+E9<U>XN3NRB)'-!HD5,B]F!D(KV%PVM
MK=7^HJO+7UH-+[PJP_YU=M&C!54+:S%G=A_<8KIL7V>L?!?WM#">R(Q8'LR(
MH?S,B/FNY+FXK]!GQ'Q7<G?&+?09,=^%/)<X*I@9L2R8$<O'F1'W6LE#Z4I>
MZ+-:EIDFM`\S36%)[=(-0X-G;BH4I5TR<KKYE+%*N_G](CU_@+_`OT1Z0Z.O
MB:%.8ZC3&.HTACJ-H4YCJ-,8ZC2&,H^APEOX3@M];*'-%NQI03FV,'HM*,<6
ME&,K0]J*:FY#/;:A'MM0CVVHQS;48QL#U8;2;_N@2#M*NOT2D0[NV_%9^#)<
M`]^%F^$G\'-X$/XCTLGX=E9"'<R!-I@/B\"#H^$M<!*\"U;!6G@/8'?G!?!1
MP+N=GX.KX3J1+I1T%RI[+@IZ+MZ>^S9X)RR'LV`CO$]DWC;X$%PLTOUQ^`Q\
M";X!WX$?PH]%YO?"I8SO\7`ZG`GK80ML!^SNP>Z>R^%*^"I<#XQMST[8!;^$
M!_>JM,<C?"`V`=("^U5^I`\%/]*'#\R/]+FL_/*]85'H/]*[2X!\;W84^H_T
MN<11OC<["OU'>C>.\KW94>@_\.<21_XRKBK8[)!7V>RH3&]VZ-5[7E]:>K+1
M_??^/#]<E-:T^7J$JS1XA"OT&CS"E8='CR1X]*C*??1H(LXL]$>X\OWHD5OV
M['NA/\*5[P=])A)'K\TC7).#1[A*<WB$:Z^/'X4"ZXH*[S&OO3T2%4H_$C7\
MI%;5!)[4*BR-'5JW8D6A*.S2D=.]-X]5V+VHZQ2*,_4!0&VF/CWZFCCJ-X[Z
MC5\(*-\XRC>.\HVC?.,HWSCWC*-\XRC?^*_A8?@3H-CC_P"4<()^)+`A09\2
MV)M`@2<8VP0*/($"3QP)J.\$ZCN!^DXP>`G4=P+UG:#]!.TG'A?I:P+4:Q_J
MM0_UVH=Z[;L+[H/?P1]%%CX%SXLL^CN\(K(8?RQF+!?70#W$8"X\([*$/B11
MYTG4>1)UGD2=)U'G2=1Y$G6>1)TG4>=)U'D2=9Y$G2<9JR3J/(DZ3S)>2>(B
MB3I/HLZ3J//DC;`#[H![X`%X%)Z$Y^`EV,,XXY=4)=!^BO93M)^B_13MIV@_
M1?LIVD_1?HKV4[2?6CMAA7VP[H14Y&LG)+W/,I&=C9)@9Z.L,'8V<AG>@VVO
M?W]W-MSUU<&VU[^_.QNYQ-'K=J\_%.QLA">ZLU&4WMDXV/;WL^R$%.W^WWR`
M_?6Q*M7_P1@^8MGRLP8WK-N\MF`6I\4CIY>>G-%E^K+T7!'O:W`#W`9WC[YF
M(`*-T`R8/-`)W;``%D(2!N!P.`J.@>/@!#@1:'/@-%@&*V$(UL#9L!FV`GT8
M.`_.AX_`1?`QN!0^!5?`%^`JH*\#U\*W@#X/W`2W`'T?N!WN!&P8N!?NAX?@
M$7@,GH"GX5EX`5Z$EV$W=C,67@G@&X\Q]JIA!F"_A_T>]GO8[V&_A_T>]GO8
M[V&_A_T>]GO8[V&_A_T>]GO8[V&_A_T>]GO8[V&_A_T>]GO8[V&_IS[`?@_[
M/>SWL-_#?@_[/>SWL-_#?N^J8'%J]DKI/ESS_^M>_;I]H?@PL[\EX[_-=(5D
$V3X`````
`
end

Bug fixes and other examples may follow. Actually, bug fixes are most likely because there are multiple techniques for folding an origami cube and I don't think that I'm using the most common technique.

(Usual instructions for uudecode process.)

Wednesday November 22, 17
03:31 AM
Code

In response to SID22548, CID596405, regarding quick duplication of files onto external USB storage, I wrote a utility for this purpose. Buy some 10 port USB hubs or suchlike. Buy required storage. Run utility and mount first batch of USB storage units. Press return key. When prompted, unmount storage. Press return key. Repeat as required.

The utility makes several assumptions including where storage appears in a filing system, available storage on external volumes and handling of special filehandes (if any). For efficient use, it also assumes that disk buffers are sufficiently large for parallel copying to multiple volumes. (Copying may be exceptionally slow if available buffers are exceeded.) It also assumes that sync is a reliable checkpoint operation and this is known to be a fraught assumption.

begin 644 duplicate.pl.gz
M'XL(`'_L%%H"`ZU4WV_;-A!^UU]Q<;1(6AI3\I*UBZ-T6S,,`88^)-WV$&<`
M+=$Q$8G42*J.T;I_^X[4S[@;5F#SB\6[[[X['K^[PP-2:T667)"*J0).?O>\
M0[@U4M$'!E=U5?",&BX%W&:*5P:=7!@F<I;#2BK(N3:*+VL'D2M8\8)IP.]?
M;W\$W=)0#3G3&+_$*"Y@;4RESPG1<ELP803;Z*E4#R2398EG/:V*UYKGZ6QV
M=OKJ*,.OL^^^/8W/#EO`1M$*"PG?1+,X>0GOU@S>2*&E,KPNI_8"UIXDR2LL
M@2KC>;I>0D45+>&#!U`IO`/<OKOZZ>8&)GY\#K7&.L_!CX%4U*R)D5A<K3)&
M<JY8AO?8PEWOHK61I:R%&;SW"S&9(S5[XB:<17-OUR3-UBQ[=$E]Q*9ZS5?&
MXO@J/#AASA@Y=U?5Q":!P#H"R"4V4TAC:;69MCG:+-]$]K#KR/)_(^,-55_R
M'MWI<SKUA72*T9PN"[;'=M:S*69J)?J&+&GV:'C7DZS,AY[X"$V#P'[*BHGP
M^FW\8F(A\'$2=8!I>H'V2WO,"JF91?7.]),F=PNQ4/?'/B%2/[/#PC0N`JVK
M*2VTB.;%\.8ZHP55X?<_W/S\6W21-`TX<MH)'<C7*DM]Z[Z+[^<VQ%D^D3\6
M)^1S^#[EY>P+,&D+\G-MVES)?2L;9QMG&W/9EN^`%9HU\55!TZ.NY6%0"UHR
M.-%!U)$A`-B?$/S"1?T4=(0^[H1QW&8M:<F;J+:J"2E9SBFQT/;A>TUKTQ&U
MX*`!!PW.JF)49*\NII14YX`E&5PM)=3B4<B-F**%403KBF5\M855713@I&@D
M4`']./ZCMN.GY&77'>_(S:1[-;1U)UOTW/,V:UQ@8?OL7]G:F[9:02)[>'5]
M$[]HP0`-.F<K+E@>^JQ@<6H'HD-&4=]1#/G@`+LTZ4;#";C'SOO--&EOP[3A
MHMF^[V51E[@,S)H)1#&M6_5V-[W`=7;]ULV%OY(R3N/_HVS[I#UNN$(T>E],
MEC3)[&^4,'$)IP$)IBXJZC"?IT_&Z9-H1-^HU$(.K.:G.+EC9YO_^'C>FW;>
M\_]QDY.^!LMJ(R_C,5W;_J;9=M%I0USM0;.%R\IL47&Y?0;%4*0HRD=>55P\
M#(K;$_>(-Y/5%J&P4K)$=A0@^3JP(AYG&A,!Z*TVK`PG604G%>YD%P,#?!(-
M6/?N0RMVP[!Y/5&@MR(;QM\I)8WWUCS>=6WKQ'HYR[N"_FYD&X2]@J/J9-J'
3#)*NQ7\1->[FOP#Q>W/<J@@`````
`
end

(Usual instructions for uudecode process.)

Tuesday October 31, 17
04:01 PM
Software

(This is the 53rd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I've been working on a video encoder and decoder which divides frames of video into square tiles and then sub-divides those tiles and applies a specified short-list of matching and compression algorithms to tiles of 256×256 pixels or significantly smaller. What real-time conferencing? Stick to MJPEG tile-types. Want lossless video? Avoid affine tile-types. Want maximum resilience? Minimize use of key-frames. Want best compression? Use everything.

In addition to viewing real-time or pre-recorded video, this arrangement can be used to display contemporary or legacy software which is running elsewhere. Indeed, it is likely to be used extensively for remoting LibreOffice Writer, Calc and Draw.

So, the video codec is likely to be used in scenarios where the majority of a window is plain white and the exceptions have a hard edge. I presumed that the patent-just-expired texture compression algorithms would handle this with ease. However, an ex-housemate suggested inclusion of the simplest historial method. Well, I'm not a complete misanthropic curmudgeon and I do listen to my friends so implemented a Run Length Encoding tile-type. Although, I wish that I hadn't.

Run length encoding schemes differ significantly and, in the case of the Amiga's IFF picture compression, may be defined ambiguously. A typical byte encoding would allow up to 128 repetitions of a following value or up to 128 literal values to follow. However, the cut could be defined anywhere. So, it could define 240 repetitions and 16 literal values. Or the opposite. There's no point defining one repetition and one literal because that's the same thing. Likewise for zero. So, definitions may be shifted by one or two values.

Should Run Length Encoding start afresh on each line? Implementations vary. The next problem is more pernicious. What is a repetition? If there is sequence of pixels and two pixels are the same, should an encoder algorithm cease encoding literals, encode a repetition of two values and then continue encoding literals? For 24 bit color, yes. For 8 bit monochrome, no. For 16 bit data, it depends. So, a dumb Run Length Encoder has to look ahead by one pixel. In some circumstances, a smart Run Length Encoder may have to look further ahead.

Stuff like this caused me to implement an assertion system inside my buffer structures. Specifically, there is a compilation directive which enables a shadow buffer with a type system. Arguably, it should be enabled by default but with intended usage of 3840×2160 pixel video divided into 64×64 pixel tiles and each of the 2040 tiles requiring multiple 4KB buffers, data-types on buffers would require a large amount of extra memory.

However, I've yet to get to the best part. I remember exactly why I didn't implement a Run Length Encoding tile-type. The RMS [Root Mean Square] error (or smarter variant) for Run Length Encoding is always zero. Therefore, when I include Run Length Encoding, the encoder invariantly chooses Run Length Encoding for every part of a video frame. Even if the choice metric is set to error divided by encode length, the result remains zero.

Run Length Encoding greatly improves quality but, also, it greatly increased encoding size. Attempts to restrict matches have been mixed. I've tried setting a minimum tile size and a maximum number of tokens per tile. However, it is easier to exclude it from the encoding process. This experience has made me slightly more of a misanthropic curmudgeon and I'm less inclined to take advice from people who know very little about codecs.

03:48 PM
Code

(This is the 52nd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I posted example code to perform an 8×8 bit matrix transpose. This is intended for multiple applications including driving multiple serial lines for a network protocol, driving multiple serial lines for I2C, driving multiple serial DACs and bit-plane audio and video compression. For this reason, I want something which is fast, flexible and reliable. I don't want to keep coming back to it. I certainly don't want to break or fork it. Although the example code is supposed to be optimized for eight bit processors, 16 bit processors, 32 bit processors and 64 bit processors, diassembly of 32 bit ARM GCC output found that the results were dismal. I am also disappointed by the amount of effort required to correct this deficiency.

I'm using a failing x86 laptop as a dumb terminal for a Raspberry Pi which is off-line and has a fixed configuration for the duration of the development cycle. Therefore, my first use of objdump -d to disassemble GCC output was applied to 32 bit ARM code on a Raspberry Pi. And the result is horrible. There are hidden functions prior to main(). That is known. But do they have to be trampoline functions? Is this related to ARM Thumb mode? Who knows. The compiled code is also quite laborious with stacks and stack frames. With hindsight, this would be the focus of my optimization. I'm aware of inlining and a large number of other compiler optimization techniques. However, with more streamlined stack conventions, there would be less code and less overhead to amortize when performing many of these optimizations. I suppose this is part of the "Why the blazes did the compiler do that???" response to optimization.

Anyhow, I'm optimizing code size on the basis that I should have unrolled loops and every instruction takes one clock cycle to execute. Therefore, smaller code means faster execution and less energy consumption. There are numerous cases where smaller code is contrary to faster execution. There are also numerous cases where instruction count is not linear with execution time. However, on this architecture, for this subroutine, it is possible to equate length with speed as a first approximation.

It helps if a desirable result is known. On 32 bit ARM, there is a four bit conditional execution field in every instruction. This allows ARM implementations to have a very simple execution unit and avoids instruction decode pipeline stalls when jumping over one or two instructions. It is faster and easier to take the hit of one wasted instruction slot. Unfortunately, the consequence of 32 bit instructions with a four bit conditional field is that code density is terrible.

Anyhow, it would assumed that 32 bit ARM has a register bit test instruction or a method of performing an AND mask operation with a short, immediate value. Being a three-address machine, it would probably have a method of doing this non-destructively. Although, with a four bit conditional field and a four bit pre-scale field, short, immediate values may be mutually exclusive with a third register reference which would be required for a non-destructive test.

Even if the compiler's input was the dumb technique for an 8×8 bit transpose, the output should be a sequence of 128 instructions or maybe 192 instructions plus some instructions to clear registers, marshall data into two of the eight 32 bit registers plus all of that stack frame nonsense. So, what did the "optimized" code produce? About 262 instructions. It this point, I felt like the commedian, Doug Stanhope, when he looks in the mirror and says "Well, that ain't right."

I (thankfully) don't read ARM assembly but a large number of stack references and square brackets indicated that GCC documentation claims about register allocation are complete bunkum. Again, it helps if desirable operation is known. With 64 bits of input packed into two 32 bit registers, the same for output plus one or two working registers, the majority of stack references would be avoided if five or six general purpose registers were available. However, eight char pointers for input and eight char pointers for output, provided as function parameters, had taken precedent over the more frequent operations.

I really wanted to keep the input pointers because I wanted to use the 8×8 bit transpose as a building block for a 12×8 transpose or suchlike for 12 bit DACs. After a re-write (times four due to premature optimization), output performed post-increment on one pointer. This improved code size but wasn't sufficient. Reluctantly, input is also obtained via post-increment of one pointer. This frees sufficient general purpose registers and the compiled code minus stack overhead is about 80 instructions.

Unfortunately, I was less than halfway through this problem. After getting sensible output on 32 bit ARM, I repeated tests on 16 bit Thumb instructions, as used in an Arduino Due, and 16 bit AVR instructions, as used on the majority of Arduino micro-controllers. After about three days, I had something which compiled reliably on x86, 32 bit ARM instructions, 16 bit ARM instructions and AVR. I have no reason to believe that it performs badly on 6502, Z80, 68000, PowerPC, MIPS or any mainstream architecture. However, I'm astounded that it took so long. I got to the point where the default command was functionally equivalent to:-

objdump -d a.out | wc -l

with the occasional back-check to make sure that I wasn't inadvertently optimizing an unrolled loop. (Yes, many hours of fun there.)

After understanding the dumbness of a compiler in this particular process, I devised a method to implement the partial 8×8 bit transpose functions in a fairly efficient manner. Indeed, this can be reduced to a macro which defines a C function. The function has local byte buffers for input and output. Conceptually, the input buffer is cleared and then data is selectively copied into the local input buffer. A call is made to an 8×8 bit transpose and then data is selectively copied from the local output buffer. The compiler is able to unroll the known quantity of clear and copy operations to virtual slots on the data stack. It is also able to eliminate redundant clear operations. Most impressively, it is able to eliminate all operations involving dummy inputs and outputs. The partial bit transpose functions are proportionately smaller than the complete 8×8 bit transpose. Unfortunately, compilation is relatively slow and this is worsened when it is trivial to define many variants of bit transpose function.

So, it is possible to optimize bit transpose functions by minimizing parameter count and by creating data trampolines which are optimized away by a compiler.

Unfortunately, I hoped that a full 8×8 bit transpose for ARM Thumb mode would compile to 40 instructions and with minimal escapes to longer instructions. The result was 80 instructions which mostly of the extended format. This is disappointing. At a minimum, this consumes twice the intended processing budget and this assumes there is no execution penalty for mis-aligned instructions. However, I'm targetting a device which has twice the requested processing power. So, I've soaked available resources and possibly eaten an extra 40% into the processing budget. It may remain desirable to write assembly implementations for one or more architectures. However, I've found a reliable but tedious method to avert this situation in some cases.

03:40 PM
Software

(This is the 51st of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I've described an outline of my ideal filing system. My ideal database allows generalized use of the full-text search facility. This requires an outline of proposed volume storage. This borrows widely from ReiserFS, ZFS and MySQL Server storage engines such as InnoDB, Nitro and NDB Cluster.

Media is striped into 512MB fragments and each unit has a map of stripe allocation types where one type is no allocation and another type is bad stripe. It is envisioned that six out of 67 stripes perform parity functions and this is rotated in a procession across units. Each stripe has a bad sector map. For 1KB sectors, this requires 64KB. For 4KB sectors, this requires 16KB. If these sectors are bad, the stripe cannot be used. If the stripe map is bad, the unit cannot be used.

The remainder of sectors within a stripe are available to an application. However, applications may not be expecting raw, disjoint storage and therefore a standard contiguous mapping with redundancy may be utilized. The full-text search for the filing system utilizes a specialized database in which search terms are striped by length with attributes such as capitalization and accents. Conceptually, "House++" would be stored as HOUSE-15-10000-00000 where digits represent punctuation, capitalization and accented characters. Sequential entries would be compressed into fragments occupying 13 or 21 sequential sectors and would be split or reconciled as required.

The general database storage requires three or more 512MB stripes per table. One or more stripes hold 13 sector fragments. One or more stripes hold 21 sector fragments. One or more stripes hold the index of fragments. All rows within a table are stored in N-dimensional Peano curve format and therefore the table is its own universal index. Sets of eight rows are bit transposed, Peano mixed and arithmetic compressed into one stream. If a fragment exceeds 13 sectors, it is placed into a 21 sector fragment. If a fragment exceeds 21 sectors, it is placed into two 13 sector fragments. All CHAR and VARCHAR fields which are longer than 13 bytes are stored in shadow tables which require their own 512MB stripes. Each definition of VARCHAR(255) requires several gigabytes of space. BLOB fields are stored in an unindexed Fibonacci filing system.

If you doubt the wisdom of chopping data so finely then please investigate the sort utility used in a previous project.

03:35 PM
Software

(This is the 50th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

I have an idea which is based upon Google Search Cache and EMC storage units. If it is possible to de-duplicate files on a global scale then it may be possible to obtain a global scale search engine as a corollary. Unfortunately, my developments in this field have been amusingly dire. I've been unable to implement a de-duplicated filing with write throughput exceeding 11KB/s. However, said filing system did not store a JPEG quantize table more than once. Nor did it store two copies of the same file in a PKZip archive.

A considerably simplified version may incur read fragmentation but would vastly superior write and replication throughput. Even a trivial test system would scale beyond the affordability of many organizations. There are various techniques for filing system de-duplication. Some of these techniques even apply to memory de-duplication used in virtual desktop or virtual server systems.

The simplest technique is to not store any file with a duplicate checksum. Some legal and medical systems use this technique because it has the advantage of providing some tamper-proofing. If a file is modified then its checksum changes and therefore it is not the oroginal file. Dropbox and MegaUpload used a similar technique and this explains the inability to maintain privileges within these systems. If you have you obtain the checksum of a file by any means then revocation of access is contrary to HTTP caching.

Block-level de-duplication works on fixed boundaries; typically 2^n for suitably large n. Linux Volume Management defaults to 4MB blocks. Anything above 256MB is recommended to maintain video or database throughput. However, I discovered a weird result in which block size is immaterial. If block size is very large than it is indistinguishable from file de-duplication. If block size is very small then fragmentation may adversely affect read or write speed. However, in the general case, it does not matter.

Byte-level de-duplication may use propietary algorithms or dedicated hardware to O(n^2) matches over 1KB data or more. This is quite energy intensive but it provides the tightest matches. It also provides the most devastating failure modes because the system has minimal redundancy.

After considering byte-level de-duplication techniques, Huffman compression and various other storage techniques (in memory and on media), I believe that Fibonacci lengths should be seriously considered. If we go through the standard Fibonacci sequence (variants exist and may be useful), we have: 1, 1, 2, 3, 5, 8, 13, 21 and 34. My proposal is that files can be stored exclusively in 13 byte fragments and 21 byte fragments. Furthermore, every fragment can be given an eight byte handle where that handle contains or is augmented with a file server cluster number and fragment checksum. With five bytes or more of contiguous fragment numbers, a malicious user who is trying to exhaust all allocations would require 13*2^40 bytes of storage. This requires a 13TB file quota. With a less rigorous checksum or checksums external to the fragment reference, the required storage is significantly larger.

In typical usage, read and write operations will be large sequential operations with minor fragmentation for data which is more commonly stored. In the worst case, unique data incurs considerable overhead. Indeed, if you only have one or two copies of data then mis-alignment by one byte incurs a duplicate copy with additional fragmentation overhead. However, by random walks and birthday paradox, savings occur rapidly. Four or five copies are likely to achieve alignment. And 22 copies of data under any alignment must attain some savings. So, if you're in an office where 500 people get CCed then you can be certain that a maximum of 21 copies of the data will be stored on disk.

In a clustered environment, there is the possibility that two nodes discover the same fragment of data independently. In the general case, this is of no consequence. After replication, one reference takes precendence and the other falls into obsolescence. To ensure load balancing, precendance may be determined by checksum prior to node ID. Stripping out stale references is an interesting exercise in a clustered environment. Although, a general solution is likely to encourage wear-leveling.

However, the benefit is the ability to perform full-text indexing. Considering seven bit ASCII first, each 13 byte fragment and 21 byte fragment has an upper bound for the number of words which do not occur adjacent to a fragment boundary. Likewise, zero or one words span consecutive fragments. (Words which are longer than 13 bytes may be awkward.) Latin1, UTF-8 and UCS2 are more awkward, especially if it contains long fragments of Japanese or suchlike. Regardless, it is possible to index such text unambiguously after a file has been closed.

All of this can be dumped into a full-text index and searched in a manner which enforces file and directory privileges. It is possible (but not easy) to perform a database join in a manner which maintains partial ordering of resource IDs. It is also possible to perform this in a manner which allows significant fragment ID compression. This should scale beyond 1EB (2^60 bytes) but I may be optimistic.

If you doubt the wisdom of chopping data so finely then please investigate the sort utility used in a previous project.

03:29 PM
Hardware

(This is the 49th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

ARM is a mess. Skip details if too complicated. When I see "ARM Cortex", I translate this as "ARM, with the interrupt controller which is incompatible with every other variant of ARM". Indeed, when we say ARM, is that ARM with the 26 bit address space and dedicated zero register, ARM with the 32 bit address space (with or without Java hardware, with or without one of the Thumb modes, with or without one of the SIMD variants) or ARM with 64 bit registers? And that's before we get to ARM licensee variants. So, for example, Atmel ARM processors are configured very much like micro-controllers. I/O pins may be set in groups of 32 to be digital inputs or digital outputs (with optional integral pull-up or pull-down resistors) or over-ridden to provide other functions. Broadcom's BCM8255, used in some models of Raspberry Pi, has a banks of 30 bit registers where I/O pins can be set in groups of 10 and potential be set to one of eight functions. I presume ARM processors from NXP have something completely different.

One of the most consistent aspects of ARM Cortex is that every manufacturer seems to offer 16 interrupt levels. No more. No less. It this regard, ARM Cortex is very similar to a TMS9900. This was a heavy-weight 3MHz mini-computer core which, entertainingly, was also the core of Speak 'N' Spell, as used by E.T. to phone home. This also has similarities to the RCA1802 which is most famous for being used in satellites.

These designs come from a very different age where open-reel tape (digital and analog) was the norm and bubble memory was just around the corner. Can you imagine a system with 2KB RAM and 4MB of solid state storage? That almost happened. Indeed, retro-future fiction based on this scenario would be fantastic.

The RCA1802 architecture is of particular interest. They hadn't quite got the idioms for subroutines. Instead, there was a four bit field to determine which of the 16 × 16 bit general purpose registers would be the program counter. Calling a subroutine involved loading the subroutine address into a register and setting the field so that the register became the new program counter. In this case, execution continued from the specified address. Return was implemented by setting the field to the previous state. This has some merit because leaf subroutines typically use less registers. So, it isn't critical if general purpose registers become progressively filled with return addresses. Indeed, if you want to sort 12 × 16 bit values, an RCA1802 does the job better than a 68000 or an ARM processor in Thumb mode. However, in general, every subroutine has to know the correct bit pattern of the four bit field for successful return. This create a strict hierarchy among subroutines and, in some cases, requires trampolines.

The TMS9900 architecture does many things in multiples of 16. It has a great instruction set which uses the 16 bit instruction space efficiently. It has 16 interrupt levels. And it keeps context switching fast by having 16 × 16 bit registers in main memory, accessed via a workspace pointer. Indeed, there are aspects of this design which seem to have influenced early iterations of SPARC and ARM.

If we mix features of the RCA1802 and the TMS9900, we have a system which responds to interrupts within one instruction cycle. Specifically, if we take a TMS9900 interrupt priority encoder and use it to select an RCA1802 register as a program counter then nested interrupts occurs seemlessly and without delay.

To implement this on a contemporary processor, we would have 16 program counters. Where virtual memory is implemented, we probably want multiple memory maps and possibly a stack pointer within each memory map. However, if all program counters initialize to known addresses after reset, we do not require any read or write access to shadow program counters. Even if a processor is virtualizing itself, by the Popek and Goldberg virtualization requirements, it is possible devise a scheme which never requires access to a shadow program counter.

In the worst case, we incur initial overhead of pushing general registers onto stack. (A sequence which may be nested as interrupt priority increases.) We also incur overhead of restoring general registers. However, rather than loading alternate program counters with an address from a dedicated register or always indirecting from a known memory location, we unconditionally jump one instruction before the interrupt handler and execute a (privileged) yield priority instruction. On a physical processor, this leaves the alternate program counter with the most useful address. On a virtual processor, privilege violation leads to inspection of the instruction and that indicates that the address should be stored elsewhere.

On the first interrupt of a given priority, we require a trampoline to the interrupt handler. But on all subsequent calls, we run the interrupt handler immediately. It typically starts by pushing registers onto a local or global stack but that is because state which is specific to interrupts may be nothing more than a shadow program counter for each interrupt priority level and a global interrupt priority field. No instructions are required to read the internal state of the interrupt system. One special instruction is required to exit interrupt and this is typical on many systems.

03:11 PM
Hardware

(This is the 48th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

Did the Chinese government invent an infinitely scalable processor switch fabric or are the figures from Chinese super-computers a complete fabrication? Indeed, if the Chinese super-computer switch fabric is so good, why does China continue to use Cisco switches for the Great Firewall Of China and why does China intend to use Cisco switches to connect all of its population?

The conventional super-computer design is to have 2^3n nodes in a three dimensional grid. For example, 4096 nodes in a 16×16×16 grid. Each node is connected to three uni-directional data loops. Ignoring network congestion, the round-trip time between any two nodes in a loop is constant. The round-trip time to any two nodes in a plane is constant. And the round-trip time to between arbitrary nodes is constant. In this arrangement, nodes have direct access to each other's memory and it is fairly easy to implement a memory interface which provides equal access to three network links and a local node.

The rate of throughput is obscene. An Intel Xeon processor may have up to 32 cores and each core has two-way hyper-threading. Each thread may consume up to five bytes of instruction per clock cycle and the clock is 4GHz. That's a peak instuction stream execution rate of 1280GB/s per node. For a 4096 node cluster, that's 5PB/s. Memory addressing is also special. 12 or more bits of an address are required merely to specify node number. With 4GB RAM per node, take 32 bit addressing and add another 12 bits for physically addressable RAM.

This arrangement is highly symmetric, highly fragile and rapidly runs into scalability problems. And yet, China just adds a cabinet or two of processor nodes and always retains to world super-computer record. Or adds GPUs to a subset of nodes. (Where does the extra memory interface come from?) Is China using partially connected, two local, six remote hypercube topology? Is there any known upper bound for this switch fabric which is at least five years old?

Assuming China's claims are true, it is possible to make a heterogeneous super-computer cluster with more than 16000 nodes and have at least 1Gb/s bandwidth per node without any significant MTBF. Even the cross-sectional area of first level cache of 16000 MIPS processors creates a significant target for random bit errors. Likewise for ALUs, FPUs and GPUs.

I investigated redundancy for storage and processing. The results are disappointing because best practice is known but rarely followed. For storage, six parity nodes in a cluster is the preferred minimum. Four is rare and zero or one is the typical arrangement. For processing, anything beyond a back-check creates more problems than solutions. Best-of-three is great but it triples energy consumption and rate of processor failure. With storage, it is mirrored at the most abstract level and that many be on different continents. At the very worst, it will be on separate magnetic platters accessed via separate micro-controllers. With processing, redundancy is on the same chip on the same board with the same power supply.

So, for processing, best practice is a twin processor back-check, like the mainframes from the 1970s. For a super-computer cluster, every node should participate in a global check-point and every computation should be mirrored on another node. Errors in parallel computation propagate extremely fast and therefore if any node finds an error, all nodes must wind back to a previous check-point. Checks can be performed entirely in software but it is also possible for a processor with two-way hyper-threading, with two sets of registers and also two ALUs and two FPUs, to run critical tasks in lock-step and throw an exception when results don't match.

Now that I've considered it in detail, it is apparent to me that Intel has never been keen on anything beyond two-way hyper-threading. I just assumed Intel was shipping dark threads to facilitate snooping. ("Hey! You've actually got eight-way hyper-threading but we use six to compress video of you fapping when you think your webcam is off.") But perhaps selected customers get a reliable mode and us plebs get the partial rejects without scalability.

02:55 PM
Hardware

(This is the 47th of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)

There have been numerous successful processor architectures over many decades but, after MIPS and Alpha fell to the wayside, I assumed we reached the point where it was x86 versus ARM. However, recent event led me to re-evaluate this situation. I've found two additional contenders. The sale of (British) ARM to (Japanese) Softbank may have over-shadowed the subsequent sale of (British) Imagination Technologies to a Chinese sovereign fund. Imagination Technologies have the rights to legacy MIPS architectures but they've most noteworthy for losing a contract because Apple intends to develop its own GPUs rather than license successors to the PowerVR architecture. The drop in value made Imagination Technologies a very worthwhile acquisition for the Chinese government, especially when Chinese (and Russian) super-computers have often been based around MIPS.

However, a third, overlooked Californian architecture could eclipse everything. We've often seen how awful, low-end architectures grow in capability, consume mid-range systems and, more recently, are clustered until they fulfill top-tier requirements. Well, that could be the Xtensa architecture. It is a 16 register, strict two-read, one-write, three-address machine with licensing for 512 bit implementation and numerous optional instructions. Unlike, MIPS, ARM or, in particular, x86, Xtensa instruction decode is ridiculously simple while maintaining good code density. Specifically, the opcode determines if the instuction is two bytes or *three* bytes with no exceptions. Potentially, this can be determined from one (or maybe two) bits of the opcode. Presumably, this requires break-point instructions of two lengths. However, three byte instructions invariably exceed the code density ARM and MIPS while having significantly simpler instruction decode.

At present, the Xtensa architecture is typically surrounded by an accretion of shit but the core CPU is sound. Xtensa processors are most commonly packaged by Expressif Systems in the EY8266 which is then packaged in various tacky micro-conroller boards with, for example, wireless networking of various protocols. Firmware is typically held in an external, I2C serial ROM and is, presumably, paged using base firmware which maintains an LRU cache. Handling of interrupts via paging is extremely slow and this has led to the EY8266 processor being unreliable and a certain method to achieve network time-out. Regardless, this is packaged into IoT devices where making a network connection is a small miracle and security is severely neglected. But don't worry because there are plenty of consultancies who upsell services to cover these deficiencies!

If you dump everything from the serial ROM and outwards, there's an architecture which can out-compete the world's fastest super-computer. I can only approach this efficiency by describing multiple iterations of vaporware. I begin with a few general observations. Firstly, every instruction set is more than 2/3 full from the first iteration. An exception to this rule is the Mostek 6502 instruction set which designed specifically to be an almost pin-compatible, smaller die, higher yield clone of Motorola's 6800 processor. (I would have been *extremely* unamused if I worked at Motorola during this period.) Secondly, it has been known for decades that approximately 1/2 of processor instructions are register moves and approximately 1/3 of processor instructions are conditional operations. Thirdly, it would be logical to assume that code density can be maximized by allocating instruction space in proportion to instruction usage.

An example of these observations is the Z80 instruction set. Ignoring prefix opcodes and alternate sets of registers, there are eight registers. Eight bit opcodes with the two top bits clear encode a move instruction. The remaining six bits of the opcode indicate a three bit source field and a three bit destination field. So, 00-011-100 specifies a move between registers and 1/4 of the instruction space is allocated to these operations. That's fairly optimal for various definitions of optimal. However, one of these eight registers is a flag register and this is rarely a source or destination. Also move to self only sets flags. So, only 42 or the 64 encodings are typically used. And the use of alternates and escapes greatly increases opcode size for the subset of total permitted moves. The Motorola 68000 architecture doubles the size of each opcode field and allows memory addressing modes as a source and/or destination but otherwise works in a similar manner. So, 0000-000:011-000:100 performs a similar task between 68000 data registers but with lower code density. From this, I wondered if it was possible to define an instruction set where the majority of the instruction set was prefix operators - even to the extent that all ALU operations would require a prefix. My interest in this architecture was renewed when AMD devised the x86 VEX prefix. In addition to providing extensible SIMD, it also upgraded x86 from a two-address machine to a three-address machine.

So, it is certainly possible, and even economically viable, to make a micro-processor where the default instruction is a two-address move, one prefix allows a larger range of registers to be addressed, another prefix specifies an ALU operation and another prefix upscales all two-address instructions to three-address instructions. (Have we consumed the instruction space? Probably and then some.) However, there is something really inefficient about this arrangement. Prefixes can be specified in any order. The trivial implementation consumes one clock cycle per prefix while accumulating internal state. However, there is a more fundamental problem related to permutations. Two prefixes have two permutations and therefore have two places in the instruction space. Three prefixes have six permutations and therefore have six places in the instruction space. Four prefixes have 24 permutations and therefore have 24 places in the instruction space. For N prefixes, we lose approximately N-1 bits of instruction space through duplication. That's grossly inefficient.

I moved away from this idea and worked on quad-tree video codecs. From this, I found that it was possible to implement instructions where two or more fields specify operands and none of the fields specify an opcode. An example would be a 16 bit instruction divided into four equal size fields. If the top bit of a field is set then the field represents opcode (instruction). If the top bit of a field is clear then the field represents operand (register reference). From this, we have 16 sets of instructions where a large number of CISC instructions apply to zero, one or two registers and we have a lesser number of instructions which apply to three registers. We also have one implicit instruction which applies to four registers. We can tweak the ratio of registers and instructions. 12, 13 or 14 general purpose registers may be preferable. I continued working on this idea and found that the fields can be divided strictly into inputs and outputs. It is possible to make a comprehensive design which permits 16 bit registers, 32 bit registers or 64 bit registers and has no prefix instructions or internal state between instructions. Donald Knuth's designs and the Itanium architecture show that it is possible to discard a dedicated flag register. I also found that it was trivial to consider a super-scalar design because the outputs of one instruction are trivial to compare against the inputs of the next instruction. However, code density is terrible, especially when handling immediate values. At best, I could achieve six bits of constant per 16 bit instruction - and even this stretches the definition of no intermediate state.

However, further work with codecs and networking found that it is possible to represent instructions of unlimited size using BER or similar. Effectively, the top bit of an eight bit field specifies if a further byte follows. This can be daisy-chained over a potentially unlimited number of bytes. From here, we view the instruction space as a binary fraction of arbitrary precision. (This works like many of the crypto-currencies where, potentially, coin ownership can be divided into arbitrarily small binary fractions.) A proportion of the instruction space may be privileged in a manner which allows a break-point instruction of arbitrary length. We may also have a fence operation which works in a similar manner. We may have addressing modes of arbitrary complexity. However, multiple reads and/or writes per opcode is very likely to be detrimental when used in conjunction with virtual memory. We also retain much of the prefix and super-scalar stuff.

We only get seven bits of instruction per byte and this significantly reduces code density. I considered disallowing one byte instructions and this may be necessary to compete with the Xtensa architecture. In either case, 1/4 of the instruction space provides move operations between general purpose instructions. Other move operations may raise the proportion of move instructions closer to the optimal 1/3. For super-scalar implementation, one unit handles trap instructions, fence instructions and branches. This ensures that all of these operations are handled sequentially. Secondary units are able to handle all ALU and FPU operations. Tertiary units have limited functionality but it is conceivable that eight FPU multiplies may be in progress without having a vector pipeline or out-of-order execution and reconciliation.

If I had unlimited resources, I would investigate this thoroughly. If I immediately required an arbitrary design then I would strongly consider the Xtensa architecture. If I immediately required a chip on a board and didn't care memory integrity then I'd consider an Arduino or Raspberry Pi. If a faster processor or ECC RAM is required then x86 remains the default choice. However, I'd be a fool to ignore ARM, MIPS, Xtensa or future possibilities.