如何计算一个32位整数的设置位数?

表示数字7的8位看起来像这样:

00000111 

三位被设置。

什么algorithm来确定一个32位整数的设置位数?

这就是所谓的“ 汉明重量 ”,“popup”或“横向增加”。

“最好”的algorithm真的取决于你在哪个CPU上,以及你的使用模式是什么。

一些CPU有一个内置的指令来完成它,而另一些则有一些作用在位向量上的并行指令。 并行指令(如支持CPU的x86 popcnt )几乎肯定是最快的。 其他一些体系结构可能会使用一个微代码循环执行慢指令,每个循环testing一个位( 需要引用 )。

预先填充的表查找方法可以非常快,如果你的CPU有一个大的caching和/或你在一个严格的循环中做了很多这些指令。 但是,由于“高速caching未命中”的花费,CPU可能会从主内存中获取部分表。

如果你知道你的字节大部分是0或大部分是1,那么这些场景就有非常高效的algorithm。

我相信一个非常好的通用algorithm是以下,称为“并行”或“可变精度SWARalgorithm”。 我已经用类似C的伪语言expression了这一点,您可能需要调整它以适用于特定语言(例如,在C ++和Java中使用uint32_t):

 int numberOfSetBits(int i) { // Java: use >>> instead of >> // C or C++: use uint32_t i = i - ((i >> 1) & 0x55555555); i = (i & 0x33333333) + ((i >> 2) & 0x33333333); return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24; } 

这是所讨论algorithm中最好的最坏情况行为,因此将有效地处理任何使用模式或值。


这种按位SWARalgorithm可以同时在多个向量元素中进行并行处理,而不是在单个整数寄存器中进行,以便在具有SIMD的CPU上进行加速,但没有可用的计数指令。 (比如必须在任何CPU上运行的x86-64代码,而不仅仅是Nehalem或更高版本)。

然而,对于popcount使用向量指令的最好方法是通常使用一个variablesshuffle在每个字节的并行时间内对4位进行表查找。 (4位索引一个保存在向量寄存器中的16个入口表)。

在Intel CPU上,硬件64位popcnt指令可以比SSSE3 PSHUFB位并行执行性能高出大约2倍,但PSHUFB是编译器要正确执行 。 否则,上证可以大幅提前。 较新的编译器版本知道Intel上的popup错误依赖关系 问题 。

参考文献:

https://graphics.stanford.edu/~seander/bithacks.html

https://en.wikipedia.org/wiki/Hamming_weight

http://gurmeet.net/puzzles/fast-bit-counting-routines/

http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count);

还要考虑编译器的内置function。

例如在GNU编译器上,你可以使用:

 int __builtin_popcount (unsigned int x); int __builtin_popcountll (unsigned long long x); 

在最糟糕的情况下,编译器会生成一个函数调用。 在最好的情况下,编译器会发出一个cpu指令来更快地完成相同的工作。

GCC内在函数甚至可以在多个平台上工作。 Popcount将成为x86架构的主stream,所以现在开始使用内在的是有道理的。 其他架构拥有多年的帐户。


在x86上,可以告诉编译器它可以使用-mpopcnt-msse4.2支持popcnt指令,以启用在同一代中添加的向量指令。 请参阅GCC x86选项 。 -march=nehalem (或者-march=你希望你的代码假设并调整的任何CPU)可能是一个不错的select。 在较旧的CPU上运行生成的二进制文件将导致非法指令错误。

为了使您的机器上的二进制文件优化,使用-march=native (使用gcc,clang或ICC)。

MSVC为x86 popcnt指令提供了一个内在的特性 ,但是与gcc不同的是,它实际上是硬件指令的一个内在要求,并且需要硬件支持。


使用std::bitset<>::count()而不是内置的

从理论上讲,任何知道如何为目标CPU高效popup的编译器应该通过ISO C ++ std::bitset<>公开该function。 实际上,对于某些目标CPU,在某些情况下,使用bit-hack AND / shift / ADD可能会更好。

对于硬件popup窗口是可选扩展(如x86)的目标体系结构,并非所有的编译器都有一个std::bitset ,在可用时利用它。 例如,MSVC无法在编译时启用popcnt支持,并且始终使用表查找 ,甚至使用/Ox /arch:AVX (这意味着SSE4.2,尽pipe从技术上说popcnt有一个单独的特征位)。

但是至less你得到了一些随处可用的东西,而在正确的目标选项中使用gcc / clang,你会得到支持它的体系结构的硬件帐号。

 #include <bitset> #include <limits> #include <type_traits> template<typename T> //static inline // static if you want to compile with -mpopcnt in one compilation unit but not others typename std::enable_if<std::is_integral<T>::value, unsigned >::type popcount(T x) { static_assert(std::numeric_limits<T>::radix == 2, "non-binary type"); // sizeof(x)*CHAR_BIT constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed; // std::bitset constructor was only unsigned long before C++11. Beware if porting to C++03 static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor"); typedef typename std::make_unsigned<T>::type UT; // probably not needed, bitset width chops after sign-extension std::bitset<bitwidth> bs( static_cast<UT>(x) ); return bs.count(); } 

在Godbolt编译器资源pipe理器中查看来自gcc,clang,icc和MSVC的asm。

x86-64 gcc -O3 -std=gnu++11 -mpopcnt发出这个:

 unsigned test_short(short a) { return popcount(a); } movzx eax, di # note zero-extension, not sign-extension popcnt rax, rax ret unsigned test_int(int a) { return popcount(a); } mov eax, edi popcnt rax, rax ret unsigned test_u64(unsigned long long a) { return popcount(a); } xor eax, eax # gcc avoids false dependencies for Intel CPUs popcnt rax, rdi ret 

PowerPC64 gcc -O3 -std=gnu++11发出(对于int arg版本):

  rldicl 3,3,0,32 # zero-extend from 32 to 64-bit popcntd 3,3 # popcount blr 

这个源代码根本不是x86特定的或GNU特有的,但是只能用gcc / clang / icc编译x86。

还要注意,没有单指令popcount的gcc对于体系结构的回退是一次一个字节的表查找。 例如,这对于ARM来说并不好。

在我看来,“最好”的解决scheme是另一位程序员(或两年后的原程序员)可以阅读而没有大量的评论。 你可能希望有一些已经提供的最快或最聪明的解决scheme,但我更喜欢随时随地的可读性。

 unsigned int bitCount (unsigned int value) { unsigned int count = 0; while (value > 0) { // until all bits are zero if ((value & 1) == 1) // check lower bit count++; value >>= 1; // shift bits, removing lower bit } return count; } 

如果你想要更快的速度(假设你已经很好地logging下你的后继),你可以使用表格查找:

 // Lookup table for fast calculation of bits set in 8-bit unsigned char. static unsigned char oneBitsInUChar[] = { // 0 1 2 3 4 5 6 7 8 9 ABCDEF (<- n) // ===================================================== 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n : : : 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn }; // Function for fast calculation of bits set in 16-bit unsigned short. unsigned char oneBitsInUShort (unsigned short x) { return oneBitsInUChar [x >> 8] + oneBitsInUChar [x & 0xff]; } // Function for fast calculation of bits set in 32-bit unsigned int. unsigned char oneBitsInUInt (unsigned int x) { return oneBitsInUShort (x >> 16) + oneBitsInUShort (x & 0xffff); } 

虽然这些依赖于特定的数据types的大小,所以他们不是那么便携。 但是,由于许多性能优化无论如何都不是可移植的,所以这可能不是问题。 如果你想要可移植性,我会坚持可读的解决scheme。

从黑客的喜悦,页。 66,图5-2

 int pop(unsigned x) { x = x - ((x >> 1) & 0x55555555); x = (x & 0x33333333) + ((x >> 2) & 0x33333333); x = (x + (x >> 4)) & 0x0F0F0F0F; x = x + (x >> 8); x = x + (x >> 16); return x & 0x0000003F; } 

在〜20-ish指令(arch依赖)中执行,不分支。

黑客的喜悦 愉快的! 强烈推荐。

我认为最快的方式 – 不使用查找表和popcount – 是以下。 它只用12个操作来计算设置的位数。

 int popcount(int v) { v = v - ((v >> 1) & 0x55555555); // put count of each 2 bits into those 2 bits v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits return c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; } 

这是可行的,因为你可以通过分成两部分来计算所设置的位的总数,计算两部分中的设置位数,然后将它们相加。 也被称为Divide and Conquer范式。 我们来详细介绍一下..

 v = v - ((v >> 1) & 0x55555555); 

两位中的位数可以是0b000b10 。 让我们试着去解决这个问题。

  --------------------------------------------- | v | (v >> 1) & 0b0101 | v - x | --------------------------------------------- 0b00 0b00 0b00 0b01 0b00 0b01 0b10 0b01 0b01 0b11 0b01 0b10 

这就是所要求的:最后一列显示了每两个比特对中设定比特的计数。 如果>= 2 (0b10)则产生0b01 ,否则产生0b00

 v = (v & 0x33333333) + ((v >> 2) & 0x33333333); 

这个陈述应该很容易理解。 在第一次操作之后,我们每两位都有设定位的计数,现在我们每4位总结一次计数。

 v & 0b00110011 //masks out even two bits (v >> 2) & 0b00110011 // masks out odd two bits 

然后我们总结上面的结果,给我们4位的设置位总数。 最后的陈述是最棘手的。

 c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; 

让我们进一步分解…

 v + (v >> 4) 

这与第二个陈述类似。 我们正在计数4个组中的位组。 我们知道,因为我们以前的操作,每个半字节都有其中的设定位数。 我们来看一个例子。 假设我们有字节0b01000010 。 这意味着第一个半字节有4位,第二个有2位。 现在我们将这些小块添加到一起。

 0b01000010 + 0b01000000 

它给了我们在第一个半字节0b01100010中的一个字节中的设置位数,因此我们屏蔽了数字中所有字节的最后四个字节(丢弃它们)。

 0b01100010 & 0xF0 = 0b01100000 

现在每个字节都有设置的位数。 我们需要把它们加在一起。 诀窍是把结果乘以0b10101010 ,这有一个有趣的属性。 如果我们的号码有四个字节ABCD ,将会产生一个新的号码,这些字节是A+B+C+D B+C+D C+DD 。 一个4字节的数字最多可以设置32位,可以表示为0b00100000

我们现在需要的是第一个字节,它是所有字节中所有设置位的总和,我们通过>> 24得到它。 该algorithm被devise为32 bit字,但可以很容易地修改为64 bit字。

我感到无聊,并计时了三种方法的十亿次迭代。 编译器是gcc -O3。 CPU是他们放在第一代Macbook Pro中的。

最快的是3.7秒以下:

 static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 }; static int popcount( unsigned int i ) { return( wordbits[i&0xFFFF] + wordbits[i>>16] ); } 

第二个地方去相同的代码,但查找4个字节,而不是2个半字。 这花了大约5.5秒。

第三名是“双向加法”,耗时8.6秒。

第四名去GCC的__builtin_popcount(),在可耻的11秒。

一点一点的计算方法慢得多,我感到无聊的等待它完成。

所以如果你关心性能高于一切然后使用第一种方法。 如果你关心,但不够花64Kb的RAM,使用第二种方法。 否则,使用可读的(但很慢)一个一位一位的方法。

很难想象一个你想要使用这种方法的情况。

编辑:类似的结果在这里 。

如果你碰巧在使用Java,内置的方法Integer.bitCount将会这样做。

这是帮助您了解您的微架构的问题之一。 我只是用gcc 4.3.3编译了两个变种,使用C ++ inline编译-O3,消除函数调用开销,十亿次迭代,保持所有计数的运行总和,以确保编译器不会删除任何重要的东西,使用rdtsc进行计时时钟周期精确)。

 inline int pop2(unsigned x,unsigned y)
 {
     x = x  - ((x >> 1)&0x55555555);
     y = y  - ((y >> 1)&0x55555555);
     x =(x&0x33333333)+((x >> 2)&0x33333333);
     y =(y&0x33333333)+((y >> 2)&0x33333333);
     x =(x +(x >> 4))&0x0F0F0F0F;
     y =(y +(y >> 4))&0x0F0F0F0F;
     x = x +(x >> 8);
     y = y +(y >> 8);
     x = x +(x >> 16);
     y = y +(y >> 16);
    返回(x + y)&0x000000FF;
 }

未修改的Hacker's Delight花费了12.2亿次。 我的并行版本(计算两倍的位数)运行在13.0 gigacycles。 在2.4GHz Core Duo上总共耗时10.5秒。 25个gigacycles =在这个时钟频率只有10多秒,所以我相信我的时间是正确的。

这与指令依赖链有关,这对于这个algorithm是非常糟糕的。 我可以通过使用一对64位寄存器再次将速度提高一倍。 事实上,如果我很聪明,并且早点加点x + y,我可以削减一些转变。 64位版本有一些小的调整会出来,甚至计算两倍的位数。

有了128位SIMD寄存器,还有两个因素,而SSE指令集通常也具有巧妙的捷径。

代码没有特别透明的原因。 界面简单,algorithm可以在多个地方进行在线参考,并且可以进行全面的unit testing。 程序员绊倒它甚至可能学到一些东西。 这些位操作在机器级别上是非常自然的。

好的,我决定修改64位版本。 对于这个sizeof(unsigned long)== 8

 inline int pop2(unsigned long x,unsigned long y)
 {
     x = x  - ((x >> 1)&0x5555555555555555);
     y = y  - ((y >> 1)&0x5555555555555555);
     x =(x&0x3333333333333333)+((x >> 2)&0x3333333333333333);
     y =(y&0x3333333333333333)+((y >> 2)&0x3333333333333333);
     x =(x +(x >> 4))&0x0F0F0F0F0F0F0F0F;
     y =(y +(y >> 4))&0x0F0F0F0F0F0F0F0F;
     x = x + y; 
     x = x +(x >> 8);
     x = x +(x >> 16);
     x = x +(x >> 32); 
    返回x&0xFF;
 }

这看起来是正确的(虽然我没有仔细testing)。 现在时间是10.70亿次/14.1亿次。 后来的数字总计128亿比特,相当于这台机器已经过了5.9秒。 非并行版本加速了一点点,因为我在64位模式下运行,它喜欢64位寄存器比32位寄存器略好。

我们来看看这里是否有更多的OOOstream水线。 这是一个更多的参与,所以我实际上testing了一下。 单独的每个项目总和为64,全部总和为256。

 inline int pop4(unsigned long x,unsigned long y, 
                 unsigned long u,unsigned long v)
 {
  枚举{m1 = 0x5555555555555555, 
          m2 = 0x3333333333333333, 
          m3 = 0x0F0F0F0F0F0F0F0F, 
          m4 = 0x000000FF000000FF};

     x = x  - ((x >> 1)&m1);
     y = y  - ((y >> 1)&m1);
     u = u  - ((u >> 1)&m1);
     v = v  - ((v >> 1)&m1);
     x =(x&m2)+((x >> 2)&m2);
     y =(y&m2)+((y >> 2)&m2);
     u =(u&m2)+((u >> 2)&m2);
     v =(v&m2)+((v >> 2)&m2);
     x = x + y; 
     u = u + v; 
     x =(x&m 3)+((x >> 4)&m 3);
     u =(u&m3)+((u >> 4)&m3);
     x = x + u; 
     x = x +(x >> 8);
     x = x +(x >> 16);
     x = x&m4; 
     x = x +(x >> 32);
    返回x&0x000001FF;
 }

我很兴奋了一会儿,但事实certificate,即使我在某些testing中没有使用inline关键字,gcc也正在使用-O3进行内联技巧。 当我让海湾合作委员会玩技巧,十亿次调用pop4()需要12.56 gigacycles,但我确定它是折叠参数作为常量expression式。 一个更现实的数字似乎是19.6gc的另一个30%的加速。 我的testing循环现在看起来像这样,确保每个参数足够不同以阻止gcc玩弄技巧。

    hitime b4 = rdtsc(); 
    for(unsigned long i = 10L * 1000 * 1000 * 1000; i <11L * 1000 * 1000 * 1000; ++ i) 
       sum + = pop4(i,i ^ 1,〜i,i | 1); 
    hitime e4 = rdtsc(); 

在8.17s总结了256亿比特。 以16位查表为基准,以3200万位工作到1.02。 不能直接比较,因为另一个工作台没有给出时钟速度,但是看起来我已经把这个鼻涕扔出了64KB的桌面版本,这首先是L1caching的悲剧使用。

更新:决定做明显的,并通过添加四个更多的重复行来创buildpop6()。 达到了22.8gc,总计达到了9.54亿个,总计达到了384亿个。 所以还有20%的时间在800ms,320亿比特。

 unsigned int count_bit(unsigned int x) { x = (x & 0x55555555) + ((x >> 1) & 0x55555555); x = (x & 0x33333333) + ((x >> 2) & 0x33333333); x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F); x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF); x = (x & 0x0000FFFF) + ((x >> 16)& 0x0000FFFF); return x; } 

让我解释一下这个algorithm。

该algorithm基于分而治之algorithm。 假设有一个8位整数213(二进制11010101),algorithm是这样工作的(每次合并两个相邻块):

 +-------------------------------+ | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | <- x | 1 0 | 0 1 | 0 1 | 0 1 | <- first time merge | 0 0 1 1 | 0 0 1 0 | <- second time merge | 0 0 0 0 0 1 0 1 | <- third time ( answer = 00000101 = 5) +-------------------------------+ 

为什么不反复地除以2?

 count = 0
而n> 0
  如果(n%2)== 1
    计数+ = 1
   n / = 2  

我同意这不是最快的,但是“最好的”有些含糊。 我认为,“最好的”应该有一个清晰的元素

在一个2 32查找表之间寻找一个愉快的媒介并逐个遍历每一个比特:

 int bitcount(unsigned int num){ int count = 0; static int nibblebits[] = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4}; for(; num != 0; num >>= 4) count += nibblebits[num & 0x0f]; return count; } 

http://ctips.pbwiki.com/CountBits

当你写出位模式时,黑客的喜悦变得更加清晰。

 unsigned int bitCount(unsigned int x) { x = (((x >> 1) & 0b01010101010101010101010101010101) + x & 0b01010101010101010101010101010101); x = (((x >> 2) & 0b00110011001100110011001100110011) + x & 0b00110011001100110011001100110011); x = (((x >> 4) & 0b00001111000011110000111100001111) + x & 0b00001111000011110000111100001111); x = (((x >> 8) & 0b00000000111111110000000011111111) + x & 0b00000000111111110000000011111111); x = (((x >> 16)& 0b00000000000000001111111111111111) + x & 0b00000000000000001111111111111111); return x; } 

第一步将偶数位添加到奇数位,在每两位中产生一个位的和。 其他步骤将高阶数据块添加到低阶数据块中,将数据块的大小加倍,直到最终数字占据整个整数。

这不是最快或者最好的解决scheme,但是我发现了同样的问题,我开始思考和思考。 最后我意识到,如果你从math的angular度来看这个问题,并且画一个图,那么你会发现它是一个具有一些周期性部分的函数,然后你就会意识到这些周期之间的差别…所以干得好:

 unsigned int f(unsigned int x) { switch (x) { case 0: return 0; case 1: return 1; case 2: return 1; case 3: return 2; default: return f(x/4) + f(x%4); } } 

这可以在O(k) ,其中k是设置的位数。

 int NumberOfSetBits(int n) { int count = 0; while (n){ ++ count; n = (n - 1) & n; } return count; } 

The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)

The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM , Volume 3 (1960) Number 5, page 322 . He gives two different algorithms there, one optimized for numbers expected to be "sparse" (ie, have a small number of ones) and one for the opposite case.

Few open questions:-

  1. If the number is negative then?
  2. If the number is 1024 , then the "iteratively divide by 2" method will iterate 10 times.

we can modify the algo to support the negative number as follows:-

 count = 0 while n != 0 if ((n % 2) == 1 || (n % 2) == -1 count += 1 n /= 2 return count 

now to overcome the second problem we can write the algo like:-

 int bit_count(int num) { int count=0; while(num) { num=(num)&(num-1); count++; } return count; } 

for complete reference see :

http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html

What do you means with "Best algorithm"? The shorted code or the fasted code? Your code look very elegant and it has a constant execution time. The code is also very short.

But if the speed is the major factor and not the code size then I think the follow can be faster:

  static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... }; static int bitCountOfByte( int value ){ return BIT_COUNT[ value & 0xFF ]; } static int bitCountOfInt( int value ){ return bitCountOfByte( value ) + bitCountOfByte( value >> 8 ) + bitCountOfByte( value >> 16 ) + bitCountOfByte( value >> 24 ); } 

I think that this will not more faster for a 64 bit value but a 32 bit value can be faster.

I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.

With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.

 #define BitCount(X,Y) \ Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \ Y = ((Y + (Y >> 3)) & 030707070707); \ Y = (Y + (Y >> 6)); \ Y = (Y + (Y >> 12) + (Y >> 24)) & 077; 

Here is a secret to the first and most complex step:

 input output AB CD Note 00 00 = AB 01 01 = AB 10 01 = AB - (A >> 1) & 0x1 11 10 = AB - (A >> 1) & 0x1 

so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.

  • Don Gillies

if you're using C++ another option is to use template metaprogramming:

 // recursive template to sum bits in an int template <int BITS> int countBits(int val) { // return the least significant bit plus the result of calling ourselves with // .. the shifted value return (val & 0x1) + countBits<BITS-1>(val >> 1); } // template specialisation to terminate the recursion when there's only one bit left template<> int countBits<1>(int val) { return val & 0x1; } 

usage would be:

 // to count bits in a byte/char (this returns 8) countBits<8>( 255 ) // another byte (this returns 7) countBits<8>( 254 ) // counting bits in a word/short (this returns 1) countBits<16>( 256 ) 

you could of course further expand this template to use different types (even auto-detecting bit size) but I've kept it simple for clarity.

edit: forgot to mention this is good because it should work in any C++ compiler and it basically just unrolls your loop for you if a constant value is used for the bit count (in other words, I'm pretty sure it's the fastest general method you'll find)

I think the Brian Kernighan's method will be useful too… It goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

 int countSetBits(unsigned int n) { unsigned int n; // count the number of bits set in n unsigned int c; // c accumulates the total bits set in n for (c=0;n>0;n=n&(n-1)) c++; return c; } 

Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to me that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"

I'm particularly fond of this example from the fortune file:

#define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x) ((x) - (((x)>>1)&0x77777777)
                             - (((x)>>2)&0x33333333)
                             - (((x)>>3)&0x11111111))

I like it best because it's so pretty!

Java JDK1.5

Integer.bitCount(n);

where n is the number whose 1's are to be counted.

check also,

 Integer.highestOneBit(n); Integer.lowestOneBit(n); Integer.numberOfLeadingZeros(n); Integer.numberOfTrailingZeros(n); //Beginning with the value 1, rotate left 16 times n = 1; for (int i = 0; i < 16; i++) { n = Integer.rotateLeft(n, 1); System.out.println(n); } 

I found an implementation of bit counting in an array with using of SIMD instruction (SSSE3 and AVX2). It has in 2-2.5 times better performance than if it will use __popcnt64 intrinsic function.

SSSE3 version:

 #include <smmintrin.h> #include <stdint.h> const __m128i Z = _mm_set1_epi8(0x0); const __m128i F = _mm_set1_epi8(0xF); //Vector with pre-calculated bit count: const __m128i T = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4); uint64_t BitCount(const uint8_t * src, size_t size) { __m128i _sum = _mm128_setzero_si128(); for (size_t i = 0; i < size; i += 16) { //load 16-byte vector __m128i _src = _mm_loadu_si128((__m128i*)(src + i)); //get low 4 bit for every byte in vector __m128i lo = _mm_and_si128(_src, F); //sum precalculated value from T _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, lo))); //get high 4 bit for every byte in vector __m128i hi = _mm_and_si128(_mm_srli_epi16(_src, 4), F); //sum precalculated value from T _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, hi))); } uint64_t sum[2]; _mm_storeu_si128((__m128i*)sum, _sum); return sum[0] + sum[1]; } 

AVX2 version:

 #include <immintrin.h> #include <stdint.h> const __m256i Z = _mm256_set1_epi8(0x0); const __m256i F = _mm256_set1_epi8(0xF); //Vector with pre-calculated bit count: const __m256i T = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4); uint64_t BitCount(const uint8_t * src, size_t size) { __m256i _sum = _mm256_setzero_si256(); for (size_t i = 0; i < size; i += 32) { //load 32-byte vector __m256i _src = _mm256_loadu_si256((__m256i*)(src + i)); //get low 4 bit for every byte in vector __m256i lo = _mm256_and_si256(_src, F); //sum precalculated value from T _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, lo))); //get high 4 bit for every byte in vector __m256i hi = _mm256_and_si256(_mm256_srli_epi16(_src, 4), F); //sum precalculated value from T _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, hi))); } uint64_t sum[4]; _mm256_storeu_si256((__m256i*)sum, _sum); return sum[0] + sum[1] + sum[2] + sum[3]; } 

I use the below code which is more intuitive.

 int countSetBits(int n) { return !n ? 0 : 1 + countSetBits(n & (n-1)); } 

Logic : n & (n-1) resets the last set bit of n.

PS : I know this is not O(1) solution, albeit an interesting solution.

There are many algorithm to count the set bits; but i think the best one is the faster one! You can see the detailed on this page:

Bit Twiddling Hacks

I suggest this one:

Counting bits set in 14, 24, or 32-bit words using 64-bit instructions

 unsigned int v; // count the number of bits set in v unsigned int c; // c accumulates the total bits set in v // option 1, for at most 14-bit values in v: c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf; // option 2, for at most 24-bit values in v: c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; // option 3, for at most 32-bit values in v: c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; 

This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 10; and the third option takes 15.

Here is a portable module ( ANSI-C ) which can benchmark each of your algorithms on any architecture.

Your CPU has 9 bit bytes? No problem 🙂 At the moment it implements 2 algorithms, the K&R algorithm and a byte wise lookup table. The lookup table is on average 3 times faster than the K&R algorithm. If someone can figure a way to make the "Hacker's Delight" algorithm portable feel free to add it in.

 #ifndef _BITCOUNT_H_ #define _BITCOUNT_H_ /* Return the Hamming Wieght of val, ie the number of 'on' bits. */ int bitcount( unsigned int ); /* List of available bitcount algorithms. * onTheFly: Calculate the bitcount on demand. * * lookupTalbe: Uses a small lookup table to determine the bitcount. This * method is on average 3 times as fast as onTheFly, but incurs a small * upfront cost to initialize the lookup table on the first call. * * strategyCount is just a placeholder. */ enum strategy { onTheFly, lookupTable, strategyCount }; /* String represenations of the algorithm names */ extern const char *strategyNames[]; /* Choose which bitcount algorithm to use. */ void setStrategy( enum strategy ); #endif 

 #include <limits.h> #include "bitcount.h" /* The number of entries needed in the table is equal to the number of unique * values a char can represent which is always UCHAR_MAX + 1*/ static unsigned char _bitCountTable[UCHAR_MAX + 1]; static unsigned int _lookupTableInitialized = 0; static int _defaultBitCount( unsigned int val ) { int count; /* Starting with: * 1100 - 1 == 1011, 1100 & 1011 == 1000 * 1000 - 1 == 0111, 1000 & 0111 == 0000 */ for ( count = 0; val; ++count ) val &= val - 1; return count; } /* Looks up each byte of the integer in a lookup table. * * The first time the function is called it initializes the lookup table. */ static int _tableBitCount( unsigned int val ) { int bCount = 0; if ( !_lookupTableInitialized ) { unsigned int i; for ( i = 0; i != UCHAR_MAX + 1; ++i ) _bitCountTable[i] = ( unsigned char )_defaultBitCount( i ); _lookupTableInitialized = 1; } for ( ; val; val >>= CHAR_BIT ) bCount += _bitCountTable[val & UCHAR_MAX]; return bCount; } static int ( *_bitcount ) ( unsigned int ) = _defaultBitCount; const char *strategyNames[] = { "onTheFly", "lookupTable" }; void setStrategy( enum strategy s ) { switch ( s ) { case onTheFly: _bitcount = _defaultBitCount; break; case lookupTable: _bitcount = _tableBitCount; break; case strategyCount: break; } } /* Just a forwarding function which will call whichever version of the * algorithm has been selected by the client */ int bitcount( unsigned int val ) { return _bitcount( val ); } #ifdef _BITCOUNT_EXE_ #include <stdio.h> #include <stdlib.h> #include <time.h> /* Use the same sequence of pseudo random numbers to benmark each Hamming * Weight algorithm. */ void benchmark( int reps ) { clock_t start, stop; int i, j; static const int iterations = 1000000; for ( j = 0; j != strategyCount; ++j ) { setStrategy( j ); srand( 257 ); start = clock( ); for ( i = 0; i != reps * iterations; ++i ) bitcount( rand( ) ); stop = clock( ); printf ( "\n\t%d psudoe-random integers using %s: %f seconds\n\n", reps * iterations, strategyNames[j], ( double )( stop - start ) / CLOCKS_PER_SEC ); } } int main( void ) { int option; while ( 1 ) { printf( "Menu Options\n" "\t1.\tPrint the Hamming Weight of an Integer\n" "\t2.\tBenchmark Hamming Weight implementations\n" "\t3.\tExit ( or cntl-d )\n\n\t" ); if ( scanf( "%d", &option ) == EOF ) break; switch ( option ) { case 1: printf( "Please enter the integer: " ); if ( scanf( "%d", &option ) != EOF ) printf ( "The Hamming Weight of %d ( 0x%X ) is %d\n\n", option, option, bitcount( option ) ); break; case 2: printf ( "Please select number of reps ( in millions ): " ); if ( scanf( "%d", &option ) != EOF ) benchmark( option ); break; case 3: goto EXIT; break; default: printf( "Invalid option\n" ); } } EXIT: printf( "\n" ); return 0; } #endif 

Fast C# solution using pre-calculated table of Byte bit counts with branching on input size.

 public static class BitCount { public static uint GetSetBitsCount(uint n) { var counts = BYTE_BIT_COUNTS; return n <= 0xff ? counts[n] : n <= 0xffff ? counts[n & 0xff] + counts[n >> 8] : n <= 0xffffff ? counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] : counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] + counts[(n >> 24) & 0xff]; } public static readonly uint[] BYTE_BIT_COUNTS = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 }; } 

I always use this in Competitive Programming and it's easy to write and efficient:

 #include <bits/stdc++.h> using namespace std; int countOnes(int n) { bitset<32> b(n); return b.count(); } 

32-bit or not ? I just came with this method in Java after reading " cracking the coding interview " 4th edition exercice 5.5 ( chap 5: Bit Manipulation). If the least significant bit is 1 increment count , then right-shift the integer.

 public static int bitCount( int n){ int count = 0; for (int i=n; i!=0; i = i >> 1){ count += i & 1; } return count; } 

I think this one is more intuitive than the solutions with constant 0x33333333 no matter how fast they are. It depends on your definition of "best algorithm" .

Personally I use this :

  public static int myBitCount(long L){ int count = 0; while (L != 0) { count++; L ^= L & -L; } return count; }