I posted example code to perform an 8×8 bit matrix transpose. This is intended for multiple applications including driving multiple serial lines for a network protocol, driving multiple serial lines for I2C, driving multiple serial DACs and bit-plane audio and video compression. For this reason, I want something which is fast, flexible and reliable. I don't want to keep coming back to it. I certainly don't want to break or fork it. Although the example code is supposed to be optimized for eight bit processors, 16 bit processors, 32 bit processors and 64 bit processors, diassembly of 32 bit ARM GCC output found that the results were dismal. I am also disappointed by the amount of effort required to correct this deficiency.
I'm using a failing x86 laptop as a dumb terminal for a Raspberry Pi which is off-line and has a fixed configuration for the duration of the development cycle. Therefore, my first use of objdump -d to disassemble GCC output was applied to 32 bit ARM code on a Raspberry Pi. And the result is horrible. There are hidden functions prior to main(). That is known. But do they have to be trampoline functions? Is this related to ARM Thumb mode? Who knows. The compiled code is also quite laborious with stacks and stack frames. With hindsight, this would be the focus of my optimization. I'm aware of inlining and a large number of other compiler optimization techniques. However, with more streamlined stack conventions, there would be less code and less overhead to amortize when performing many of these optimizations. I suppose this is part of the "Why the blazes did the compiler do that???" response to optimization.
Anyhow, I'm optimizing code size on the basis that I should have unrolled loops and every instruction takes one clock cycle to execute. Therefore, smaller code means faster execution and less energy consumption. There are numerous cases where smaller code is contrary to faster execution. There are also numerous cases where instruction count is not linear with execution time. However, on this architecture, for this subroutine, it is possible to equate length with speed as a first approximation.
It helps if a desirable result is known. On 32 bit ARM, there is a four bit conditional execution field in every instruction. This allows ARM implementations to have a very simple execution unit and avoids instruction decode pipeline stalls when jumping over one or two instructions. It is faster and easier to take the hit of one wasted instruction slot. Unfortunately, the consequence of 32 bit instructions with a four bit conditional field is that code density is terrible.
Anyhow, it would assumed that 32 bit ARM has a register bit test instruction or a method of performing an AND mask operation with a short, immediate value. Being a three-address machine, it would probably have a method of doing this non-destructively. Although, with a four bit conditional field and a four bit pre-scale field, short, immediate values may be mutually exclusive with a third register reference which would be required for a non-destructive test.
Even if the compiler's input was the dumb technique for an 8×8 bit transpose, the output should be a sequence of 128 instructions or maybe 192 instructions plus some instructions to clear registers, marshall data into two of the eight 32 bit registers plus all of that stack frame nonsense. So, what did the "optimized" code produce? About 262 instructions. It this point, I felt like the commedian, Doug Stanhope, when he looks in the mirror and says "Well, that ain't right."
I (thankfully) don't read ARM assembly but a large number of stack references and square brackets indicated that GCC documentation claims about register allocation are complete bunkum. Again, it helps if desirable operation is known. With 64 bits of input packed into two 32 bit registers, the same for output plus one or two working registers, the majority of stack references would be avoided if five or six general purpose registers were available. However, eight char pointers for input and eight char pointers for output, provided as function parameters, had taken precedent over the more frequent operations.
I really wanted to keep the input pointers because I wanted to use the 8×8 bit transpose as a building block for a 12×8 transpose or suchlike for 12 bit DACs. After a re-write (times four due to premature optimization), output performed post-increment on one pointer. This improved code size but wasn't sufficient. Reluctantly, input is also obtained via post-increment of one pointer. This frees sufficient general purpose registers and the compiled code minus stack overhead is about 80 instructions.
Unfortunately, I was less than halfway through this problem. After getting sensible output on 32 bit ARM, I repeated tests on 16 bit Thumb instructions, as used in an ArduinoDue, and 16 bit AVR instructions, as used on the majority of Arduino micro-controllers. After about three days, I had something which compiled reliably on x86, 32 bit ARM instructions, 16 bit ARM instructions and AVR. I have no reason to believe that it performs badly on 6502, Z80, 68000, PowerPC, MIPS or any mainstream architecture. However, I'm astounded that it took so long. I got to the point where the default command was functionally equivalent to:-
objdump -d a.out | wc -l
with the occasional back-check to make sure that I wasn't inadvertently optimizing an unrolled loop. (Yes, many hours of fun there.)
After understanding the dumbness of a compiler in this particular process, I devised a method to implement the partial 8×8 bit transpose functions in a fairly efficient manner. Indeed, this can be reduced to a macro which defines a C function. The function has local byte buffers for input and output. Conceptually, the input buffer is cleared and then data is selectively copied into the local input buffer. A call is made to an 8×8 bit transpose and then data is selectively copied from the local output buffer. The compiler is able to unroll the known quantity of clear and copy operations to virtual slots on the data stack. It is also able to eliminate redundant clear operations. Most impressively, it is able to eliminate all operations involving dummy inputs and outputs. The partial bit transpose functions are proportionately smaller than the complete 8×8 bit transpose. Unfortunately, compilation is relatively slow and this is worsened when it is trivial to define many variants of bit transpose function.
So, it is possible to optimize bit transpose functions by minimizing parameter count and by creating data trampolines which are optimized away by a compiler.
Unfortunately, I hoped that a full 8×8 bit transpose for ARM Thumb mode would compile to 40 instructions and with minimal escapes to longer instructions. The result was 80 instructions which mostly of the extended format. This is disappointing. At a minimum, this consumes twice the intended processing budget and this assumes there is no execution penalty for mis-aligned instructions. However, I'm targetting a device which has twice the requested processing power. So, I've soaked available resources and possibly eaten an extra 40% into the processing budget. It may remain desirable to write assembly implementations for one or more architectures. However, I've found a reliable but tedious method to avert this situation in some cases.
Bit Matrix Transpose, Part 2
(This is the 52nd of many promised articles which explain an idea in isolation. It is hoped that ideas may be adapted, linked together and implemented.)
I posted example code to perform an 8×8 bit matrix transpose. This is intended for multiple applications including driving multiple serial lines for a network protocol, driving multiple serial lines for I2C, driving multiple serial DACs and bit-plane audio and video compression. For this reason, I want something which is fast, flexible and reliable. I don't want to keep coming back to it. I certainly don't want to break or fork it. Although the example code is supposed to be optimized for eight bit processors, 16 bit processors, 32 bit processors and 64 bit processors, diassembly of 32 bit ARM GCC output found that the results were dismal. I am also disappointed by the amount of effort required to correct this deficiency.
I'm using a failing x86 laptop as a dumb terminal for a Raspberry Pi which is off-line and has a fixed configuration for the duration of the development cycle. Therefore, my first use of objdump -d to disassemble GCC output was applied to 32 bit ARM code on a Raspberry Pi. And the result is horrible. There are hidden functions prior to main(). That is known. But do they have to be trampoline functions? Is this related to ARM Thumb mode? Who knows. The compiled code is also quite laborious with stacks and stack frames. With hindsight, this would be the focus of my optimization. I'm aware of inlining and a large number of other compiler optimization techniques. However, with more streamlined stack conventions, there would be less code and less overhead to amortize when performing many of these optimizations. I suppose this is part of the "Why the blazes did the compiler do that???" response to optimization.
Anyhow, I'm optimizing code size on the basis that I should have unrolled loops and every instruction takes one clock cycle to execute. Therefore, smaller code means faster execution and less energy consumption. There are numerous cases where smaller code is contrary to faster execution. There are also numerous cases where instruction count is not linear with execution time. However, on this architecture, for this subroutine, it is possible to equate length with speed as a first approximation.
It helps if a desirable result is known. On 32 bit ARM, there is a four bit conditional execution field in every instruction. This allows ARM implementations to have a very simple execution unit and avoids instruction decode pipeline stalls when jumping over one or two instructions. It is faster and easier to take the hit of one wasted instruction slot. Unfortunately, the consequence of 32 bit instructions with a four bit conditional field is that code density is terrible.
Anyhow, it would assumed that 32 bit ARM has a register bit test instruction or a method of performing an AND mask operation with a short, immediate value. Being a three-address machine, it would probably have a method of doing this non-destructively. Although, with a four bit conditional field and a four bit pre-scale field, short, immediate values may be mutually exclusive with a third register reference which would be required for a non-destructive test.
Even if the compiler's input was the dumb technique for an 8×8 bit transpose, the output should be a sequence of 128 instructions or maybe 192 instructions plus some instructions to clear registers, marshall data into two of the eight 32 bit registers plus all of that stack frame nonsense. So, what did the "optimized" code produce? About 262 instructions. It this point, I felt like the commedian, Doug Stanhope, when he looks in the mirror and says "Well, that ain't right."
I (thankfully) don't read ARM assembly but a large number of stack references and square brackets indicated that GCC documentation claims about register allocation are complete bunkum. Again, it helps if desirable operation is known. With 64 bits of input packed into two 32 bit registers, the same for output plus one or two working registers, the majority of stack references would be avoided if five or six general purpose registers were available. However, eight char pointers for input and eight char pointers for output, provided as function parameters, had taken precedent over the more frequent operations.
I really wanted to keep the input pointers because I wanted to use the 8×8 bit transpose as a building block for a 12×8 transpose or suchlike for 12 bit DACs. After a re-write (times four due to premature optimization), output performed post-increment on one pointer. This improved code size but wasn't sufficient. Reluctantly, input is also obtained via post-increment of one pointer. This frees sufficient general purpose registers and the compiled code minus stack overhead is about 80 instructions.
Unfortunately, I was less than halfway through this problem. After getting sensible output on 32 bit ARM, I repeated tests on 16 bit Thumb instructions, as used in an Arduino Due, and 16 bit AVR instructions, as used on the majority of Arduino micro-controllers. After about three days, I had something which compiled reliably on x86, 32 bit ARM instructions, 16 bit ARM instructions and AVR. I have no reason to believe that it performs badly on 6502, Z80, 68000, PowerPC, MIPS or any mainstream architecture. However, I'm astounded that it took so long. I got to the point where the default command was functionally equivalent to:-
with the occasional back-check to make sure that I wasn't inadvertently optimizing an unrolled loop. (Yes, many hours of fun there.)
After understanding the dumbness of a compiler in this particular process, I devised a method to implement the partial 8×8 bit transpose functions in a fairly efficient manner. Indeed, this can be reduced to a macro which defines a C function. The function has local byte buffers for input and output. Conceptually, the input buffer is cleared and then data is selectively copied into the local input buffer. A call is made to an 8×8 bit transpose and then data is selectively copied from the local output buffer. The compiler is able to unroll the known quantity of clear and copy operations to virtual slots on the data stack. It is also able to eliminate redundant clear operations. Most impressively, it is able to eliminate all operations involving dummy inputs and outputs. The partial bit transpose functions are proportionately smaller than the complete 8×8 bit transpose. Unfortunately, compilation is relatively slow and this is worsened when it is trivial to define many variants of bit transpose function.
So, it is possible to optimize bit transpose functions by minimizing parameter count and by creating data trampolines which are optimized away by a compiler.
Unfortunately, I hoped that a full 8×8 bit transpose for ARM Thumb mode would compile to 40 instructions and with minimal escapes to longer instructions. The result was 80 instructions which mostly of the extended format. This is disappointing. At a minimum, this consumes twice the intended processing budget and this assumes there is no execution penalty for mis-aligned instructions. However, I'm targetting a device which has twice the requested processing power. So, I've soaked available resources and possibly eaten an extra 40% into the processing budget. It may remain desirable to write assembly implementations for one or more architectures. However, I've found a reliable but tedious method to avert this situation in some cases.
Post Comment