SoylentNews Comments | Abusing A CPU’s Adders To Optimize Bit Counting

Abusing A CPU’s Adders To Optimize Bit Counting

posted by Fnord666 on Thursday June 20 2019, @09:43AM

from the abusing-bits-for-fun-and...profit? dept.

Submitted via IRC for Bytram

Abusing A CPU's Adders To Optimize Bit Counting

If you like nitpicking around C code and generated assembly language — and we’ll admit that we do — you shouldn’t miss [Scaramanga’s] analysis of what’s known as Kernighan’s trick. If you haven’t heard of the trick, it’s a pretty efficient way of counting bits.
Like the Wheatstone bridge and a lot of other things, the Kernighan trick is misnamed. Brian Kernighan made it famous, but it was actually first published in 1960 and again in 1964 before he wrote about it in 1988. The naive way to count bits would be to scan through each bit position noting how many one bits you encounter, but the problem is, that takes a loop for each bit. A 64-bit word, then, takes 64 loops no matter what it contains. You can do slightly better by removing each bit you find and stopping when the word goes to zero, but that still could take 64 cycles if the last bit you test is set.
Using the trick, you make the observation that X & (X-1) will always clear the least significant bit of a word. Try a few examples:

X X-1 X&(X-1)

0001 0000 0000

0010 0001 0000

0011 0010 0010

1010 1001 1000

1100 1011 1000

1111 1110 1110

You can probably see where this is going. By computing X&(X-1) you clear a bit on each loop iteration and you only have to go through the number of bits that are actually set.
[...] If you like this sort of thing, be sure to check out [Sean Anderson’s] extensive list of bit hacks. It shows several different ways to count bits and do other common and uncommon tasks with different tradeoffs. For example, you could dedicate a 256-entry lookup table and do the whole thing with one loop per byte with great speed but bad memory utilization. Always a trade-off.

X	X-1	X&(X-1)
0001	0000	0000
0010	0001	0000
0011	0010	0010
1010	1001	1000
1100	1011	1000
1111	1110	1110

Original Submission

This discussion has been archived. No new comments can be posted.

Abusing A CPU’s Adders To Optimize Bit Counting | Log In/Create an Account | Top | 31 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 1) by shrewdsheep on Thursday June 20 2019, @10:26AM (8 children)

by shrewdsheep (5215) on Thursday June 20 2019, @10:26AM (#857843)

I think this is a good solution. However, you would be on the order of ~ 16 cycles. If you do the trick from TFA and do it also on the complement you would be finished it in *at most* 16 cycles for a 32 bit integer, assuming that the operations for complement and raw number run simultaneously on a super-scalar architecture.
BTW & 0xff at the very end should be faster (o.k. probably the compiler will catch it).

Parent
Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 0) by Anonymous Coward on Thursday June 20 2019, @10:42AM (2 children)

by Anonymous Coward on Thursday June 20 2019, @10:42AM (#857847)

& 0xFF would be an optimization for % 256, not % 255.

Parent
- Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 1) by shrewdsheep on Thursday June 20 2019, @10:56AM (1 child)
  
  by shrewdsheep (5215) on Thursday June 20 2019, @10:56AM (#857850)
  
  I stand corrected. This implies, that an (expensive) division is needed making the formula arguably more elegant but less efficient.
  
  Parent
  - Re:I popcount like this (32 bit) (Score: 0) by Anonymous Coward on Thursday June 20 2019, @11:09AM
    
    by Anonymous Coward on Thursday June 20 2019, @11:09AM (#857853)
    
    There's some voodoo modern compilers will do to deduct a division result through multiplication, but I'm not sure if it would be applicable in this case.
    
    Parent
Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 0) by Anonymous Coward on Thursday June 20 2019, @11:10AM (1 child)

by Anonymous Coward on Thursday June 20 2019, @11:10AM (#857854)

Not that good, as it contains a very slow division operation at the end, which not only has a high latency but ties up an arithmetic execution unit for the whole time. But this algorithm doesn't use any memory and is more consistent in speed than the Kernighan algorithm.
The lookup table has a long wait if it misses the cache, but at least the CPU can do other things while it waits. It's probably the best if you need to count a lot of bits all at once, and you can expect it to be in L1 most of the time. The thing is that if performance has any chance of mattering here, you probably do need to count a very large number of bits (or a small number very often, same thing). You could also count four bits at a time with only a 16-byte table, which might be better if you are very short on space.
Overall, today's CPUs are very fast and it's unlikely that the choice of algorithm here would ever make a difference. The obvious algorithm and lookup tables are the only ones that are immediately understandable, and the lookup table is faster in situations where it matters, so I'd probably go with one of those almost always. And x86/ARM have instructions to do this specific thing anyway.
But maybe you are not running on a powerful x86 or ARM, but instead on an attiny where you have a hard real time requirement and you can't spend 256, or even 16, bytes on a lookup table. You don't have a division instruction either. Maybe Kernighan's algorithm is best here.

Parent
- Re:I popcount like this (32 bit) (Score: 1, Informative) by Anonymous Coward on Thursday June 20 2019, @01:22PM
  
  by Anonymous Coward on Thursday June 20 2019, @01:22PM (#857894)
  
  No division is needed:
  https://godbolt.org/z/4PRXJw [godbolt.org]
  
  Parent
Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 0) by Anonymous Coward on Thursday June 20 2019, @12:52PM (1 child)

by Anonymous Coward on Thursday June 20 2019, @12:52PM (#857880)

xcuse me, but how is & 0xff different from & 255 and why are you bringing up 256 aka 0x100 ????

Parent
- Re:I popcount like this (32 bit) (Score: 1, Informative) by Anonymous Coward on Thursday June 20 2019, @01:15PM
  
  by Anonymous Coward on Thursday June 20 2019, @01:15PM (#857889)
  
  Because AND is not the same operation as DIV.
  When you want the remainder of a power of two, you AND it against the value - 1. So N % 256 becomes N & (256 - 1).
  
  Parent
Re:I popcount like this (32 bit) (Score: 2) by Immerman on Thursday June 20 2019, @08:34PM

by Immerman (3985) on Thursday June 20 2019, @08:34PM (#858207)

It has the advantage though that it avoids looping, or more to the point, conditional jumps. In a modern CPU with branch prediction mispredicts a jump (presumably the last "exit" jump in a loop, if the code was well-optimized for the looping itself) it forces the CPU pipeline to be flushed, which costs an additional [pipeline depth] cycles.

Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Abusing A CPU’s Adders To Optimize Bit Counting

Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 1) by shrewdsheep on Thursday June 20 2019, @10:26AM (8 children)

Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 0) by Anonymous Coward on Thursday June 20 2019, @10:42AM (2 children)

Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 1) by shrewdsheep on Thursday June 20 2019, @10:56AM (1 child)

Re:I popcount like this (32 bit) (Score: 0) by Anonymous Coward on Thursday June 20 2019, @11:09AM

Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 0) by Anonymous Coward on Thursday June 20 2019, @11:10AM (1 child)

Re:I popcount like this (32 bit) (Score: 1, Informative) by Anonymous Coward on Thursday June 20 2019, @01:22PM

Re:I popcount like this (32 bit) Re:I popcount like this (32 bit) (Score: 0) by Anonymous Coward on Thursday June 20 2019, @12:52PM (1 child)

Re:I popcount like this (32 bit) (Score: 1, Informative) by Anonymous Coward on Thursday June 20 2019, @01:15PM

Re:I popcount like this (32 bit) (Score: 2) by Immerman on Thursday June 20 2019, @08:34PM