(This is the 52nd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I posted example code to perform an 8×8 bit matrix transpose. This is intended for multiple applications including driving multiple serial lines for a network protocol, driving multiple serial lines for I2C, driving multiple serial DACs and bit-plane audio and video compression. For this reason, I want something which is fast, flexible and reliable. I don't want to keep coming back to it. I certainly don't want to break or fork it. Although the example code is supposed to be optimized for eight bit processors, 16 bit processors, 32 bit processors and 64 bit processors, diassembly of 32 bit ARM GCC output found that the results were dismal. I am also disappointed by the amount of effort required to correct this deficiency.
I'm using a failing x86 laptop as a dumb terminal for a Raspberry Pi which is off-line and has a fixed configuration for the duration of the development cycle. Therefore, my first use of objdump -d to disassemble GCC output was applied to 32 bit ARM code on a Raspberry Pi. And the result is horrible. There are hidden functions prior to main(). That is known. But do they have to be trampoline functions? Is this related to ARM Thumb mode? Who knows. The compiled code is also quite laborious with stacks and stack frames. With hindsight, this would be the focus of my optimization. I'm aware of inlining and a large number of other compiler optimization techniques. However, with more streamlined stack conventions, there would be less code and less overhead to amortize when performing many of these optimizations. I suppose this is part of the "Why the blazes did the compiler do that???" response to optimization.
Anyhow, I'm optimizing code size on the basis that I should have unrolled loops and every instruction takes one clock cycle to execute. Therefore, smaller code means faster execution and less energy consumption. There are numerous cases where smaller code is contrary to faster execution. There are also numerous cases where instruction count is not linear with execution time. However, on this architecture, for this subroutine, it is possible to equate length with speed as a first approximation.
It helps if a desirable result is known. On 32 bit ARM, there is a four bit conditional execution field in every instruction. This allows ARM implementations to have a very simple execution unit and avoids instruction decode pipeline stalls when jumping over one or two instructions. It is faster and easier to take the hit of one wasted instruction slot. Unfortunately, the consequence of 32 bit instructions with a four bit conditional field is that code density is terrible.
Anyhow, it would assumed that 32 bit ARM has a register bit test instruction or a method of performing an AND mask operation with a short, immediate value. Being a three-address machine, it would probably have a method of doing this non-destructively. Although, with a four bit conditional field and a four bit pre-scale field, short, immediate values may be mutually exclusive with a third register reference which would be required for a non-destructive test.
Even if the compiler's input was the dumb technique for an 8×8 bit transpose, the output should be a sequence of 128 instructions or maybe 192 instructions plus some instructions to clear registers, marshall data into two of the eight 32 bit registers plus all of that stack frame nonsense. So, what did the "optimized" code produce? About 262 instructions. It this point, I felt like the commedian, Doug Stanhope, when he looks in the mirror and says "Well, that ain't right."
I (thankfully) don't read ARM assembly but a large number of stack references and square brackets indicated that GCC documentation claims about register allocation are complete bunkum. Again, it helps if desirable operation is known. With 64 bits of input packed into two 32 bit registers, the same for output plus one or two working registers, the majority of stack references would be avoided if five or six general purpose registers were available. However, eight char pointers for input and eight char pointers for output, provided as function parameters, had taken precedent over the more frequent operations.
I really wanted to keep the input pointers because I wanted to use the 8×8 bit transpose as a building block for a 12×8 transpose or suchlike for 12 bit DACs. After a re-write (times four due to premature optimization), output performed post-increment on one pointer. This improved code size but wasn't sufficient. Reluctantly, input is also obtained via post-increment of one pointer. This frees sufficient general purpose registers and the compiled code minus stack overhead is about 80 instructions.
Unfortunately, I was less than halfway through this problem. After getting sensible output on 32 bit ARM, I repeated tests on 16 bit Thumb instructions, as used in an Arduino Due, and 16 bit AVR instructions, as used on the majority of Arduino micro-controllers. After about three days, I had something which compiled reliably on x86, 32 bit ARM instructions, 16 bit ARM instructions and AVR. I have no reason to believe that it performs badly on 6502, Z80, 68000, PowerPC, MIPS or any mainstream architecture. However, I'm astounded that it took so long. I got to the point where the default command was functionally equivalent to:-
objdump -d a.out | wc -l
with the occasional back-check to make sure that I wasn't inadvertently optimizing an unrolled loop. (Yes, many hours of fun there.)
After understanding the dumbness of a compiler in this particular process, I devised a method to implement the partial 8×8 bit transpose functions in a fairly efficient manner. Indeed, this can be reduced to a macro which defines a C function. The function has local byte buffers for input and output. Conceptually, the input buffer is cleared and then data is selectively copied into the local input buffer. A call is made to an 8×8 bit transpose and then data is selectively copied from the local output buffer. The compiler is able to unroll the known quantity of clear and copy operations to virtual slots on the data stack. It is also able to eliminate redundant clear operations. Most impressively, it is able to eliminate all operations involving dummy inputs and outputs. The partial bit transpose functions are proportionately smaller than the complete 8×8 bit transpose. Unfortunately, compilation is relatively slow and this is worsened when it is trivial to define many variants of bit transpose function.
So, it is possible to optimize bit transpose functions by minimizing parameter count and by creating data trampolines which are optimized away by a compiler.
Unfortunately, I hoped that a full 8×8 bit transpose for ARM Thumb mode would compile to 40 instructions and with minimal escapes to longer instructions. The result was 80 instructions which mostly of the extended format. This is disappointing. At a minimum, this consumes twice the intended processing budget and this assumes there is no execution penalty for mis-aligned instructions. However, I'm targetting a device which has twice the requested processing power. So, I've soaked available resources and possibly eaten an extra 40% into the processing budget. It may remain desirable to write assembly implementations for one or more architectures. However, I've found a reliable but tedious method to avert this situation in some cases.
(Score: 1) by khallow on Tuesday October 31 2017, @11:01PM (2 children)
l0*2+l1 -> (10 << 2) + 11
and
NBBY/8 -> NBBY >> 3
Some of the loop testing would be a bit faster, testing against constants instead of calculations (make the calculation first and put in temp var, then compare in the loop against the var), but I guess that stops being a problem with loop unrolling. I guess it was meant for commenting purposes, but should get rid of multiplication and division by 1.
Also, you probably can reduce the use of temp vars like t0 and t1. For example, a quick trick, which should help, for swapping two variables (works with floats too in C) is:
first |= second;
second |= first;
first |= second;
That's three XOR operations and the only moves might be to registers and back.
Also, I'm willing to make a wild guess that you can compute the transpose by reversing the bits of your ints first (log of the size of the int to do via masking and swapping, there's probably something really slick out there), swapping the order of the ints, and then swapping the remaining out of place bits in blocks to where they're supposed to be. That might be faster than the partly quad-recursive approach you're using at the moment... or it might not.
There might be mileage to be gained from making ints into blocks rather than rows. For example, in four bits:
1 | 2
--------
3 | 4
For a larger 16 bit block, that might look like:
1 2 | 5 6
3 4 | 7 8
----------
11 10|13 14
9 8 | 15 16
Reversing the order of the bits, does most of the work of the transpose with only the diagonal blocks somewhat out of place (swap the diagonal blocks and then swap elements 1 and 4, 13 and 16). In particular, 64 bit ints require very little swapping with this approach (three rounds of swapping) after the reversal of bits happens in order to generate the transport of an 8x8 bit matrix.
Then you can do things like logic multiplication of blocks as well, which allows for chaining of certain kinds of transformations in a way that better respects your int boundaries. In the 16 bit case, you're operating with the 16 bit int either on another block of 16 bits or on a 4 bit vector. That should simplify the math at higher levels, but more of a mess at the int block level. Not sure, it's worth the bother.
(Score: 2) by cafebabe on Wednesday November 01 2017, @11:18AM
I hope there is a solution which produces better compiler output. I thought my first attempt was sufficient but I now have reservations about my second attempt.
I understand that the bit masks are from Donald Knuth's work on this problem and that they are derived from binary representations of fractions which re-occur in binary, like 1/7 or 1/9. Regardless, I'm astounded that I've spent about a week on this problem when a solution was probably derived in the 1960s and an optimized solution should be library code or easy to find. For example, a partial or full transpose was commonly required in Amiga games, such as the inevitable port of Doom.
Variable swaps may be converted to register exchanges on some processor architectures. (I'm concerned that a three step XOR may hinder register exchange in some circumstances. For example, swaping register and a stack element.) Compiler optimization converts multiplications and divisions into bit shift operations without difficulty. Indeed, I have the opposite problem on AVR where I can only perform division by 2^n. Anything else generates a stray symbol reference.
Regardless, we're still at the level where a compiler peep-hole optimizes common bit hacks but generally fails with anything more significant. Also, a compiler wouldn't use registers sensibly until registers exceeded parameters and it wouldn't optimize a partial transpose until it could eliminate references to elements in a buffer in a parent subroutine. This is a specific case of a fixed number of iterations in a loop allowing a loop to be unrolled and then a compiler breaking elements of a struct or buffer into separate variables for the purpose of allocating variables to registers or stack (via one of many algorithms). From this, double writes (clear then copy) can be simplified. Likewise for writes to nowhere. Forcing this action required a subroutine with dummy buffers.
Finally, I really like the idea of the 16 bit implementation being four blocks of 16 bits. It breaks the symmetry of the implementation but it makes the coarsest swap into a trivial operation.
1702845791×2
(Score: 1) by khallow on Wednesday November 01 2017, @08:11PM
12 11 |
10 9 |