I require one common routine which:-
Implementation contains further documentation:-
begin 644 transpose.c.gz
M'XL(`/-D<ED"`^V::V_;-A2&O_M7$"@PV$[2B(SF&DM3H.F&!5BR`6DS8!CZ
M@99IBR@EJB(5-UCZWW>HBW67+"]I"BQN69GBX<MS>'DL'/5XBLZY1E=4A_P+
M^A!27P52,?1'P$*JN?05FAZ/CJ=H_&Y"+/P*?7`9>@>W9:AYY+TTK::9^YKY
M2[9$*QFB!=='"^JON;]&BFF%Y`HQOG8UU$).!5(Z9-13+U-M*I1$D6*K2,3]
M`T%]&B)'>D'(E`(OD&:.Z_//$5/;(8T[UBMR`FHTU*E4>M-&BMZRV/(%7X%C
M*_3[^?E?HQ?PC?LLKJ#YZ`7XS%=)`!IQA9A_R\UX$(AVJ49Z.R$;+@22CA.%
M$"L*0NF`9^`K#1V7@W<Z`E?!2KLH9&NN-`NW\<E`<X\K$.5>()C'?)W.+4P,
M%6L90C</I)@1ON4PCUE7]L41T=),I&9>($,:WJ%;"I.X$$P=HOET;F8;><GZ
MY>[";*J?CK+Q$<+3F1U;9LZ]S-O(](24VE2AT9[B66MC-GRM3;LPF?`W@)7A
M3B1H*.Y,\+!KJ*_1QF4A*[KW]OH*=E;D+;()4*#=X=7;/Z]S2SQK];$R/]7I
M7TDAY`91L]4\N5W6HF=5!84^1Q`4^+\TFU["!,'$"NE\*OJG-\4&L\R@H>(5
M5AL:!/D"@S&CCINIF@%4M#A:\G@;)$.0*:D/$?<**`^-NK$P/9NUM_T;U?$4
M]Z@;BXIZ<N:#2&_G"EV]/T=MGTO3EM@='!T==9;4[AZ]@C*#\B,4&\H)%`(%
M0[&@/.L]ZSV27LPP`U/X.?)B6&SWN;$I?FKUK3_E3[W^$'97$G[F&NPNI%@6
M[7Z6&W\7O4NVT@_I7]WN:>>O.%^=>C=!2:\XGT6[Z_BQYO'6=\_YBY^J5N;W
M9,5#I=OV\N,SNUI_UNO6JS+KOS+Q?Z2W`[-;KY4S>)]RHE[/[!(>W*=<K=>K
M>@E7Z_6A_K7Y56-)S*_<KVH]LTOXE?>OUI_G[YO-7\YLQ1SI+[\]M-NN3ZV7
MP;3M^M1Z&;3:KD^ME\&T[?J$>CW0WLV?"W.0.ZY#]:YZ-)_]>_:OL']EI!\_
M$=)5ANK->LH^/.TJ0_7LGK(/3[O*4#W24_;A:5<9JF?UE'WX'&>59?Y60-!P
M#8\JVJ6^R9$64Z8;&<%#$CQ8A1N7"0_16\J%25CG^5EDDJTR,(GAMEQZ(1FY
MIMQ''O>Y1P5:,)^MN`8%AZ'%G6:("K[V35;7)"FIUF"=YT#ATYB,UF'$3![:
M1T+*`$5^*(4P>78*3UY2NQ!;FK9/`XZS\RPTOU.YNGE;0:,EEX?(HW=HPY6+
MM,Q2TQ"OHZF_CL-LG$45.2ZBE80S3O+>GR,(3-\A14W6NIQVQW:_R:S?9-YK
M0JQ^DTY?8.5@,B%JAR:O)(ZG:<\TYV[Z*I>O3(I=?5+)BY9%_`:CF,B.?',C
M62S8(]D$9M;Y*Q:ZH7?U?F:CQ'GX!-9=_4:WDB_SY9I[:C&.?`5[#*P<%Y9R
MZED3],_(B!=O"^M0X%/SPLF\;[K^Y=>9O7WA=/7V_6_6);(6U@X?#)^VZ\WE
M95GT`D2[.NPR8$WT)O6T2[AOT*HH;@[?&!=+=WM--`E_B$1_^#@)O]RM+M+=
M7A4EI?!Q^F=@O28:A[^?5FOX)`X_M=H:#ZP;T<(!$1+0&O_#@=,>_MN\@3V>
M?SQM-]+6H<:F'2`[%M:9=2JLUTDW^'9PD!Q!9,2$]1&:XUILC(TQ3HPQ?,N-
M,_/[LW'SL!//@O;I_$#@CZ]?@]34J$P2\:^CI/1XI*VS9)0?DF.?=-:X=/?B
MM.C]N-AT,[D?:PM&'QOQ`SR9VA-S"[]Y4[IUNH\[N-$=W.X.KKM#ZNZ0/=TA
MC>Z0=G=(W1U<=P?O[$[_ALDW1.HB#-2T+UXPH5@!_R>D&?]=X!S.]N&,'0[0
MX70<CK[A7!L.K2)LBARRRQSJIH\]A#ZDESXY<^QAS(G].$N^MF[F>67\[/"9
MH4ITRD_@0=(QM[@X+3H?GX-Q1>0B.X!)T)-)N4NNF/6LCG&9'>JR0'?PWP_A
M[.^+<'8WX4@CX>QD?4K/M)-QC7B=R,.S?N3U(ZX):4,1U@^@?N#T`Z9(#>7*
M,$<**2,E:6R'"AD"%;L7*O%P*5;(,*R0'"ND=0?9^V+%?G"LV#6LV/MCA3P(
M4W$O4W%_\+@<O#V8J;@<O+U;\-\/Q$@WQ.Q&B)'](=;&K79>M7.JG4\-7.KE
M42W3T,X1G&]@O#/_S5);[:?7VFXNTG1ZK7P#6SN>WHIBUK,ZQM#3B_=%E]5^
M>JT&K.#^X#M.KU5#5^,80T\OWA==);])2_`Y5DA_\*0<_+PQ^'DM^,(8Y>#G
<U?,:_Z_M\@6AD.DH]$]'7T>C?P&?0(9LUBX`````
`
end
Bit matrix transpose algorithm is its own inverse. So, test performs two transpose operations. Test input uses a marching bit test which is a more thorough version of a walking bit test:-
begin 644 transpose-test.c.gz
M'XL(`/QD<ED"`XV1P6[;,`R&[WX*SD$`Q342V6[1`G%V2"^[M+ODL&'+`-66
M$P*2;$C*L*'HNU>T/3?9>I@.$L6/XO^#6B6P10\/PEO\!3LKC.M:)^%S)ZWP
MV!K82>?AD[!&.@?)*EHEP.X7.<]N87>4<-\:UUJ/)[TD2AAUIZ26QCO0PE9'
M-`=X"B*>.C6M[2]Z4/238OM'<>I#&OPV+\!Y87V?G:&IU*F64#I?8[L\?KQ(
MV2!UD8NG_LLJIN>-J64#C]OMUV@6(C2RO\!=-).FQB:*T)`Y-(P"80]56AV%
MA20)\<\%/$<`)^/P8&0-/='7WZC%/M4W0[#^IT3Q5&6IRE-5K*-`M=1.>J:O
M4Y[2D\7Z+'GSE@S9,"^F\@U?J[S,PW9U-7@8248D*ZD^G&]PQ)PP'S$_QT"^
M5;;_L6%962K>6QB!U%7WFXP$@Y.]84WSO-/N*92<H5ZP(,%B%"PN!0&Z\$&^
M8?&<YU]BFI<J]F<=7J*_"[^;^/_$L6%D6W<TU."\M_UAP]_7;P0J$![8'-,Y
?+H)*_S_\'2_#2?L+?8>5_F0-H\IP?P5])VKL/@,`````
`
end
After more extensive testing, code requires something akin to:-
#ifdef __avr
#define REG16 1
#endif
#ifdef __arm
#define REG32 1
#endif
(Score: 1, Informative) by Anonymous Coward on Sunday July 30 2017, @05:35PM (2 children)
You know I said that about 20 years ago I started having some ideas? Well, I tried writing some code and ran out of spare time and stuff. It's still rotting away on the Interwebs [sourceforge.net] and does nothing much of anything particularly useful, but a couple of years back I shoehorned in a homemade C unit testing framework that I'd also been making up as I went along.
(Score: 2) by cafebabe on Thursday August 10 2017, @04:41AM (1 child)
I've reviewed your code fully. Structure is good. Build is good. Tests are sparse but arguably acceptable given the purpose of the code. I'm particularly impressed with the comments which are often able to express the purpose of a subroutine in one line. I understand that the purpose of the code was to get ahead of available compiler optimizations and I presume that code was at least twice as fast.
I don't want to over-state a position but are you seeking work in this field and are you able to re-locate?
1702845791×2
(Score: 0) by Anonymous Coward on Friday August 11 2017, @09:24AM
Thanks for your very kind words. When I started writing that code back in about 1999 I had an AMD K6-2/400 which had "3DNow!" which was single-precision FP SIMD. They were a year or two ahead of intel, whose SSE came out later. The AMD stuff worked like the (integer) MMX instructions in that it reused the 387 FP registers. This meant the OS didn't have to be patched to support extra registers. It could also, downhill with a following wind, do two FP adds and two FP muls in parallel. I later upgraded to the 500MHz version of the CPU. The 387-style FP in the K6-2 wasn't as fast as the intel equivalent so it wasn't as good for legacy code or double precision, however, the 4x4 matrix multiply (my 3DNow! implementation) was many times faster than the equivalent C code compiled by gcc on the same machine. I also did a crude comparison of that routine compiled from the C implementation on a Sun Ultra60 (450MHz) and my 3DNow! routine on my very cheap AMD CPU was about the same speed to within a few percent (might have been slightly faster).
Life got in the way and I ran out of time. It turned out to be more work than I'd initially imagined, so I kind of forgot about it.
What I never had time to do was to reorder the instructions to take advantage of the CPU pipeline and to do things like prefetching cache lines worth of data. To begin with I just wanted a reasonably comprehensive set of portable routines that worked before I began trying to optimise them. Many of those moves, adds and multiplies could be shuffled about to take advantage of parallelism on the CPU. However, as you know that is highly specific to the model of CPU, and it's probably a lot of work for the programmer :-)
In the years since, I've had several jobs in software and I've been very lucky to have been taught and mentored by one or two very competent people. For example, I have learned and practised TDD over several years, done a bit of C++ and Java, and finally found myself working in a very interesting industry.