Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Thursday June 20 2019, @09:43AM   Printer-friendly
from the abusing-bits-for-fun-and...profit? dept.

Submitted via IRC for Bytram

Abusing A CPU's Adders To Optimize Bit Counting

If you like nitpicking around C code and generated assembly language — and we’ll admit that we do — you shouldn’t miss [Scaramanga’s] analysis of what’s known as Kernighan’s trick. If you haven’t heard of the trick, it’s a pretty efficient way of counting bits.

Like the Wheatstone bridge and a lot of other things, the Kernighan trick is misnamed. Brian Kernighan made it famous, but it was actually first published in 1960 and again in 1964 before he wrote about it in 1988. The naive way to count bits would be to scan through each bit position noting how many one bits you encounter, but the problem is, that takes a loop for each bit. A 64-bit word, then, takes 64 loops no matter what it contains. You can do slightly better by removing each bit you find and stopping when the word goes to zero, but that still could take 64 cycles if the last bit you test is set.

Using the trick, you make the observation that X & (X-1) will always clear the least significant bit of a word. Try a few examples:

   X      X-1   X&(X-1)
000100000000
001000010000
001100100010
101010011000
110010111000
111111101110

You can probably see where this is going. By computing X&(X-1) you clear a bit on each loop iteration and you only have to go through the number of bits that are actually set.

[...] If you like this sort of thing, be sure to check out [Sean Anderson’s] extensive list of bit hacks. It shows several different ways to count bits and do other common and uncommon tasks with different tradeoffs. For example, you could dedicate a 256-entry lookup table and do the whole thing with one loop per byte with great speed but bad memory utilization. Always a trade-off.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by VanessaE on Thursday June 20 2019, @07:28PM (1 child)

    by VanessaE (3396) <vanessa.e.dannenberg@gmail.com> on Thursday June 20 2019, @07:28PM (#858153) Journal

    In the 8-bit/6502 days, we would probably have used BASIC to pre-calculate a bit-counting array, just using whatever algorithm works, strongly preferring human-readability over speed, since it probably only has to be done once.

    For each pass through a 0-255 loop, count the number of bits that the loop index itself would have set, and store that count into the array at that index. For example, the value 123 (0x7B, %01111011) has six "1" bits, so entry #123 in the array gets set to 6.

    Write the completed array to a file, one byte per entry (so 256 bytes total).

    At assembly time, that table gets crunched together with all the other tables, code segments, images, etc. that make up the distribution executable.

    At runtime, the table would unpack to its final location in RAM, along with everything else. The code to count bits would use it thus:

    3    LDX value
    4    LDA counting_table,X
    2    CLC
    3    LDX value+1
    4    ADC counting_table,X
    3    LDX value+2
    4    ADC counting_table,X
    [...]

    For those who don't know the 6502: numbers on the left are CPU cycle counts, assuming the value to be counted is somewhere in Zero Page. Since they aren't prefixed with $, counting_table and value are pointers into RAM. By extension, "+1" alters a pointer, not the value found there.

    For an 8-bit value, you'd just stop at the LDA instruction, for 7 cycles. A 16-bit value would need the CLC/LDX/ADC trio after that, and would take 16 cycles to count. One LDX/ADC pair per additional byte, so a 64-bit number would take 58 cycles in total. Less than one cycle per bit seems pretty snappy to me.

    Of course, I doubt anyone tinkering with a 6502 uses such large numbers in a situation where the bits must be counted, but the above will work for any number of up to 31 bytes in length. I'm sure there's some number length where some other algorithm would be faster, to say nothing of modern CPU architectures.

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 3, Interesting) by Rich on Thursday June 20 2019, @09:19PM

    by Rich (945) on Thursday June 20 2019, @09:19PM (#858223) Journal

    The most amazing bit juggling I've seen on the 6502 is the loader code on the Apple Disk II controller. It somehow manages to generate the mapping table from 8-bit disk nibbles to 6-bit data (according to tricky rules about consecutive zeroes and ones) and decode a sector with it in 256 bytes of slot ROM code. 256 bytes which it has also has to share with code to slot-independently start the drive, smoothly drive the head stepper to track 0, identify when sector 0 passes under the head, read its nibbles, and jump to the loaded code. That, and the PROM logic of the Woz-Machine, is, for its time, like alien knowledge sent down to make a disk drive work with the absolutely most low-end hardware around it. That advantage was as good as being able to print money, and it paid for the Mac development.