什么是最快/最有效的方法来find一个整数在C中的最高设置位（MSB）？

如果我有一个整数n，我想知道最重要的位的位置（也就是说，如果最低有效位是在右边，我想知道最左边的位是1）的位置，什么是最快/最有效的找出方法？

我知道POSIX在strings.h中支持一个ffs()方法来查找第一个设置位，但似乎没有对应的fls()方法。

有没有一些真正明显的做法，我失踪了？

如果你不能使用POSIX函数来实现可移植性呢？

编辑：如何在32位和64位体系结构上工作的解决scheme（许多代码清单看起来像只能在32位整数）。

GCC有：

  - 内置函数：int __builtin_clz（unsigned int x）
     返回X中前导0位的数量，从最多开始
     有意义的位置。 如果X是0，结果是不确定的。

  - 内build函数：int __builtin_clzl（unsigned long）
     类似`__builtin_clz'，除了参数types是`unsigned
     长'。

  - 内置函数：int __builtin_clzll（unsigned long long）
     类似`__builtin_clz'，除了参数types是`unsigned
     漫长的“。

我希望他们能够被翻译成对于你当前平台来说相当有效率的东西，不pipe是那些奇怪的位元algorithm，还是单个指令。

假设你在x86上，并且有一些内联汇编器的游戏，Intel提供了一个BSR指令（“反向位扫描”）。它在某些 x86上很快（在其他微码上）。从手册：

在源操作数中search最重要的设置位（1位）。如果find最重要的1位，则其位索引存储在目标操作数中。源操作数可以是寄存器或存储器位置; 目标操作数是一个寄存器。位索引是源操作数的位0的无符号偏移量。如果内容源操作数为0，则目标操作数的内容未定义。

（如果你在PowerPC上，有一个类似的cntlz （“count leading zeros”）指令。）

gcc的示例代码：

 #include <iostream> int main (int,char**) { int n=1; for (;;++n) { int msb; asm("bsrl %1,%0" : "=r"(msb) : "r"(n)); std::cout << n << " : " << msb << std::endl; } return 0; }

另请参阅内联汇编程序教程，该教程显示（第9.4节），比循环代码快得多。

由于2 ^ N是只有第N位集合（1 << N）的整数，因此find最高位集合的位置（N）是该整数的整数对数基数2。

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

 unsigned int v; unsigned r = 0; while (v >>= 1) { r++; }

这个“显而易见的”algorithm可能对每个人都不是透明的，但是当你意识到代码重复地移动一位直到最左边的位被移走（注意C将任何非零值视为真）并返回数字的轮class，这是非常有道理的。这也意味着即使设置了多个位，它也能工作 – 结果始终是最重要的位。

如果您在该页面上向下滚动，则会出现更快，更复杂的变化。然而，如果你知道你正在处理的数字有很多前导零，那么天真的方法可以提供可接受的速度，因为在C中位移是相当快的，简单的algorithm不需要索引一个数组。

注意：使用64位值时，对于使用超级智能algorithm要格外小心; 他们中的许多人只能为32位值正确工作。

这应该是闪电般的：

 int msb(unsigned int v) { static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9}; v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v = (v >> 1) + 1; return pos[(v * 0x077CB531UL) >> 27]; }

这就像find一种整数日志。有点捣蛋的技巧，但我已经为此做了我自己的工具。当然目标是速度。

我的实现是CPU已经有一个自动位检测器，用于整数浮点转换！那就用那个吧

 double ff=(double)(v|1); return ((*(1+(unsigned long *)&ff))>>20)-1023; // assumes x86 endianness

这个版本将值转换为double，然后读取指数，它告诉你该位在哪里。花式移位和减法是从IEEE值中提取适当的部分。

使用浮动的速度稍微快一些，但浮点只能给你第一个24位的位置，因为它的精度较小。

Kaz Kylheku在这里

我对这个63位数字（gcc x86_64上的long longtypes）的两种方法进行了基准testing，远离符号位。

（我碰巧需要这个“find最高位”的东西，你看。）

我实现了数据驱动的二进制search（紧密基于上述答案之一）。我还手工完成了一个完全展开的决策树，它只是具有立即操作数的代码。没有循环，没有表格。

决策树（highest_bit_unrolled）基准比69％更快，除了二进制search具有明确testing的n = 0的情况。

对于0情况的二进制search特殊testing仅比没有特殊testing的决策树快48％。

编译器，机器:( GCC 4.5.2，-O3，x86-64,2867Mhz Intel Core i5）。

 int highest_bit_unrolled(long long n) { if (n & 0x7FFFFFFF00000000) { if (n & 0x7FFF000000000000) { if (n & 0x7F00000000000000) { if (n & 0x7000000000000000) { if (n & 0x4000000000000000) return 63; else return (n & 0x2000000000000000) ? 62 : 61; } else { if (n & 0x0C00000000000000) return (n & 0x0800000000000000) ? 60 : 59; else return (n & 0x0200000000000000) ? 58 : 57; } } else { if (n & 0x00F0000000000000) { if (n & 0x00C0000000000000) return (n & 0x0080000000000000) ? 56 : 55; else return (n & 0x0020000000000000) ? 54 : 53; } else { if (n & 0x000C000000000000) return (n & 0x0008000000000000) ? 52 : 51; else return (n & 0x0002000000000000) ? 50 : 49; } } } else { if (n & 0x0000FF0000000000) { if (n & 0x0000F00000000000) { if (n & 0x0000C00000000000) return (n & 0x0000800000000000) ? 48 : 47; else return (n & 0x0000200000000000) ? 46 : 45; } else { if (n & 0x00000C0000000000) return (n & 0x0000080000000000) ? 44 : 43; else return (n & 0x0000020000000000) ? 42 : 41; } } else { if (n & 0x000000F000000000) { if (n & 0x000000C000000000) return (n & 0x0000008000000000) ? 40 : 39; else return (n & 0x0000002000000000) ? 38 : 37; } else { if (n & 0x0000000C00000000) return (n & 0x0000000800000000) ? 36 : 35; else return (n & 0x0000000200000000) ? 34 : 33; } } } } else { if (n & 0x00000000FFFF0000) { if (n & 0x00000000FF000000) { if (n & 0x00000000F0000000) { if (n & 0x00000000C0000000) return (n & 0x0000000080000000) ? 32 : 31; else return (n & 0x0000000020000000) ? 30 : 29; } else { if (n & 0x000000000C000000) return (n & 0x0000000008000000) ? 28 : 27; else return (n & 0x0000000002000000) ? 26 : 25; } } else { if (n & 0x0000000000F00000) { if (n & 0x0000000000C00000) return (n & 0x0000000000800000) ? 24 : 23; else return (n & 0x0000000000200000) ? 22 : 21; } else { if (n & 0x00000000000C0000) return (n & 0x0000000000080000) ? 20 : 19; else return (n & 0x0000000000020000) ? 18 : 17; } } } else { if (n & 0x000000000000FF00) { if (n & 0x000000000000F000) { if (n & 0x000000000000C000) return (n & 0x0000000000008000) ? 16 : 15; else return (n & 0x0000000000002000) ? 14 : 13; } else { if (n & 0x0000000000000C00) return (n & 0x0000000000000800) ? 12 : 11; else return (n & 0x0000000000000200) ? 10 : 9; } } else { if (n & 0x00000000000000F0) { if (n & 0x00000000000000C0) return (n & 0x0000000000000080) ? 8 : 7; else return (n & 0x0000000000000020) ? 6 : 5; } else { if (n & 0x000000000000000C) return (n & 0x0000000000000008) ? 4 : 3; else return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0); } } } } } int highest_bit(long long n) { const long long mask[] = { 0x000000007FFFFFFF, 0x000000000000FFFF, 0x00000000000000FF, 0x000000000000000F, 0x0000000000000003, 0x0000000000000001 }; int hi = 64; int lo = 0; int i = 0; if (n == 0) return 0; for (i = 0; i < sizeof mask / sizeof mask[0]; i++) { int mi = lo + (hi - lo) / 2; if ((n >> mi) != 0) lo = mi; else if ((n & (mask[i] << lo)) != 0) hi = mi; } return lo + 1; }

快速和肮脏的testing程序：

 #include <stdio.h> #include <time.h> #include <stdlib.h> int highest_bit_unrolled(long long n); int highest_bit(long long n); main(int argc, char **argv) { long long n = strtoull(argv[1], NULL, 0); int b1, b2; long i; clock_t start = clock(), mid, end; for (i = 0; i < 1000000000; i++) b1 = highest_bit_unrolled(n); mid = clock(); for (i = 0; i < 1000000000; i++) b2 = highest_bit(n); end = clock(); printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2); printf("time1 = %d\n", (int) (mid - start)); printf("time2 = %d\n", (int) (end - mid)); return 0; }

只使用-O2，差异变大。决策树几乎快了四倍。

我还用这个天真的位移代码作为基准：

 int highest_bit_shift(long long n) { int i = 0; for (; n; n >>= 1, i++) ; /* empty */ return i; }

正如人们所期望的那样，这对于小数字来说是快速的。在确定n == 1的最高位为1时，其基准速度提高了80％以上。然而，在63位空间中随机select的一半数字已经设置了第63位！

在input0x3FFFFFFFFFFFFFFF上，决策树的版本比1上快，比移位器快了1120％（12.2倍）。

我还会将决策树与GCC buildin进行基准比较，并尝试混合input，而不是重复input相同的数字。可能会出现一些粘连分支的预测，或许是一些不切实际的caching情况，这使得人工重复的速度变得更快。

 unsigned int msb32(register unsigned int x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); return(x & ~(x >> 1)); }

1个寄存器，13条指令。信不信由你，这通常比上面提到的BSR指令要快，它在线性时间内运行。这是对数时间。

从http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit

关于什么

 int highest_bit(unsigned int a) { int count; std::frexp(a, &count); return count - 1; }

？

这里有一些（简单的）基准，目前在这个页面上给出的algorithm…

algorithm还没有经过无符号整数的所有inputtesting; 所以先检查一下，然后盲目地使用某些东西;）

在我的机器上clz（__builtin_clz）和asm工作得最好。 ASM似乎更快然后CLZ …但它可能是由于简单的基准…

 //////// go.c /////////////////////////////// // compile with: gcc go.c -o go -lm #include <math.h> #include <stdio.h> #include <stdlib.h> #include <time.h> /***************** math ********************/ #define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */ \ ((unsigned) log2(a)) /* thus: do not use if a <= 0 */ #define NUM_OF_HIGHESTBITmath(a) ((a) \ ? (1U << POS_OF_HIGHESTBITmath(a)) \ : 0) /***************** clz ********************/ unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1); #define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */ #define NUM_OF_HIGHESTBITclz(a) ((a) \ ? (1U << POS_OF_HIGHESTBITclz(a)) \ : 0) /***************** i2f ********************/ double FF; #define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023) #define NUM_OF_HIGHESTBITi2f(a) ((a) \ ? (1U << POS_OF_HIGHESTBITi2f(a)) \ : 0) /***************** asm ********************/ unsigned OUT; #define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT) #define NUM_OF_HIGHESTBITasm(a) ((a) \ ? (1U << POS_OF_HIGHESTBITasm(a)) \ : 0) /***************** bitshift1 ********************/ #define NUM_OF_HIGHESTBITbitshift1(a) (({ \ OUT = a; \ OUT |= (OUT >> 1); \ OUT |= (OUT >> 2); \ OUT |= (OUT >> 4); \ OUT |= (OUT >> 8); \ OUT |= (OUT >> 16); \ }), (OUT & ~(OUT >> 1))) \ /***************** bitshift2 ********************/ int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9}; #define POS_OF_HIGHESTBITbitshift2(a) (({ \ OUT = a; \ OUT |= OUT >> 1; \ OUT |= OUT >> 2; \ OUT |= OUT >> 4; \ OUT |= OUT >> 8; \ OUT |= OUT >> 16; \ OUT = (OUT >> 1) + 1; \ }), POS[(OUT * 0x077CB531UL) >> 27]) #define NUM_OF_HIGHESTBITbitshift2(a) ((a) \ ? (1U << POS_OF_HIGHESTBITbitshift2(a)) \ : 0) #define LOOPS 100000000U int main() { time_t start, end; unsigned ui; unsigned n; /********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/ printf("math\n"); for (ui = 0U; ui < 18; ++ui) printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui)); printf("\n\n"); printf("clz\n"); for (ui = 0U; ui < 18U; ++ui) printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui)); printf("\n\n"); printf("i2f\n"); for (ui = 0U; ui < 18U; ++ui) printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui)); printf("\n\n"); printf("asm\n"); for (ui = 0U; ui < 18U; ++ui) { printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui)); } printf("\n\n"); printf("bitshift1\n"); for (ui = 0U; ui < 18U; ++ui) { printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui)); } printf("\n\n"); printf("bitshift2\n"); for (ui = 0U; ui < 18U; ++ui) { printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui)); } printf("\n\nPlease wait...\n\n"); /************************* Simple clock() benchmark ******************/ start = clock(); for (ui = 0; ui < LOOPS; ++ui) n = NUM_OF_HIGHESTBITmath(ui); end = clock(); printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC); start = clock(); for (ui = 0; ui < LOOPS; ++ui) n = NUM_OF_HIGHESTBITclz(ui); end = clock(); printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC); start = clock(); for (ui = 0; ui < LOOPS; ++ui) n = NUM_OF_HIGHESTBITi2f(ui); end = clock(); printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC); start = clock(); for (ui = 0; ui < LOOPS; ++ui) n = NUM_OF_HIGHESTBITasm(ui); end = clock(); printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC); start = clock(); for (ui = 0; ui < LOOPS; ++ui) n = NUM_OF_HIGHESTBITbitshift1(ui); end = clock(); printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC); start = clock(); for (ui = 0; ui < LOOPS; ++ui) n = NUM_OF_HIGHESTBITbitshift2(ui); end = clock(); printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC); printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n"); return EXIT_SUCCESS; }

虽然如果我绝对需要最好的性能（例如写一些涉及位板的棋盘游戏AI），我可能只会使用这种方法，但最有效的解决scheme是使用内联ASM。有关解释，请参阅本博客文章的“优化”部分以获取代码。

bsrl汇编指令计算最重要的位的位置。因此，我们可以使用这个asm语句：
 asm ("bsrl %1, %0" : "=r" (position) : "r" (number)); 

我需要一个例程来做到这一点，并在search网页（并find这个网页）之前，我想出了一个基于二进制search我自己的解决scheme。虽然我确定有人以前做过这个！它运行在不断的时间，可以比“显而易见”的解决scheme更快，虽然我没有做出任何伟大的主张，只是张贴兴趣。

 int highest_bit(unsigned int a) { static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 }; const unsigned int *mask = maskv; int l, h; if (a == 0) return -1; l = 0; h = 32; do { int m = l + (h - l) / 2; if ((a >> m) != 0) l = m; else if ((a & (*mask << l)) != 0) h = m; mask++; } while (l < h - 1); return l; }

这是一种二进制search，它适用于各种（无符号！）整数types

 #include <climits> #define UINT (unsigned int) #define UINT_BIT (CHAR_BIT*sizeof(UINT)) int msb(UINT x) { if(0 == x) return -1; int c = 0; for(UINT i=UINT_BIT>>1; 0<i; i>>=1) if(static_cast<UINT>(x >> i)) { x >>= i; c |= i; } return c; }

完成：

 #include <climits> #define UINT unsigned int #define UINT_BIT (CHAR_BIT*sizeof(UINT)) int lsb(UINT x) { if(0 == x) return -1; int c = UINT_BIT-1; for(UINT i=UINT_BIT>>1; 0<i; i>>=1) if(static_cast<UINT>(x << i)) { x <<= i; c ^= i; } return c; }

认为按位运算符。

我第一次误解了这个问题。你应该产生一个最左边的位设置（其他零）。假设cmp被设置为该值：

 position = sizeof(int)*8 while(!(n & cmp)){ n <<=1; position--; }

扩大乔希的基准…可以改善如下情况

 /***************** clz2 ********************/ #define NUM_OF_HIGHESTBITclz2(a) ((a) \ ? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \ : 0)

关于asm：请注意，有bsr和bsrl（这是“长”版本）。正常的可能会快一点。

正如上面的答案所指出的那样，有很多方法可以确定最重要的位。然而，也有人指出，这些方法可能是32位或64位寄存器所独有的。 stanford.edu bithacks页面提供适用于32位和64位计算的解决scheme。通过一些工作，他们可以结合起来提供一个坚实的跨架构方法来获得MSB。我在64位和32位计算机上编译/工作的解决scheme是：

 #if defined(__LP64__) || defined(_LP64) # define BUILD_64 1 #endif #include <stdio.h> #include <stdint.h> /* for uint32_t */ /* CHAR_BIT (or include limits.h) */ #ifndef CHAR_BIT #define CHAR_BIT 8 #endif /* CHAR_BIT */ /* * Find the log base 2 of an integer with the MSB N set in O(N) * operations. (on 64bit & 32bit architectures) */ int getmsb (uint32_t word) { int r = 0; if (word < 1) return 0; #ifdef BUILD_64 union { uint32_t u[2]; double d; } t; // temp tu[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000; tu[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word; td -= 4503599627370496.0; r = (tu[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF; #else while (word >>= 1) { r++; } #endif /* BUILD_64 */ return r; }

一些过于复杂的答案在这里。只有当input已经是2的幂时才能使用德布林技术，否则就有更好的方法。对于2input的功率，在我testing过的任何处理器上，Debruin是绝对最快的，甚至比_BitScanReverse更快。但是，在一般情况下， _BitScanReverse （或者编译器中调用的任何内部函数）是最快的（在某些CPU上它可以被微编码）。

如果内在函数不是一个选项，那么这是处理一般input的最佳软件解决scheme。

 u8 inline log2 (u32 val) { u8 k = 0; if (val > 0x0000FFFFu) { val >>= 16; k = 16; } if (val > 0x000000FFu) { val >>= 8; k |= 8; } if (val > 0x0000000Fu) { val >>= 4; k |= 4; } if (val > 0x00000003u) { val >>= 2; k |= 2; } k |= (val & 2) >> 1; return k; }

请注意，与其他大多数答案不同，此版本最终不需要进行“德布林”查找。它计算的位置。

如果您重复调用足够的次数，表格可能更可取，caching未命中的风险由于表格的加速而变得黯然失色。

 u8 kTableLog2[256] = { 0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7 }; u8 log2_table(u32 val) { u8 k = 0; if (val > 0x0000FFFFuL) { val >>= 16; k = 16; } if (val > 0x000000FFuL) { val >>= 8; k |= 8; } k |= kTableLog2[val]; // precompute the Log2 of the low byte return k; }

这应该产生任何这里给出的软件答案的最高吞吐量，但如果你只是偶尔调用，喜欢一个无表的解决scheme，像我的第一个片段。

C中使用逐次逼近的版本：

 unsigned int getMsb(unsigned int n) { unsigned int msb = sizeof(n) * 4; unsigned int step = msb; while (step > 1) { step /=2; if (n>>msb) msb += step; else msb -= step; } if (n>>msb) msb++; return (msb - 1); }

优点：无论提供的数量如何，运行时间都是恒定的，因为循环次数总是相同的。（使用“unsigned int”时为4个循环）

Visual Studio只对ephemient的gcc唯一的答案是：

 auto n = 13; unsigned long Index; _BitScanReverse(&Index, n); cout << "MSB is: " << Index << endl; // Prints 3 (zero offset)

一个关于_BitScanReverse ，它在1或0的input上返回0 。
有关更多信息： http : //msdn.microsoft.com/en-us/library/fbxyd7zd.aspx

请注意，你要做的是计算一个整数的整数log2，

 #include <stdio.h> #include <stdlib.h> unsigned int Log2(unsigned long x) { unsigned long n = x; int bits = sizeof(x)*8; int step = 1; int k=0; for( step = 1; step < bits; ) { n |= (n >> step); step *= 2; ++k; } //printf("%ld %ld\n",x, (x - (n >> 1)) ); return(x - (n >> 1)); }

请注意，您可以尝试一次search多个位。

 unsigned int Log2_a(unsigned long x) { unsigned long n = x; int bits = sizeof(x)*8; int step = 1; int step2 = 0; //observe that you can move 8 bits at a time, and there is a pattern... //if( x>1<<step2+8 ) { step2+=8; //if( x>1<<step2+8 ) { step2+=8; //if( x>1<<step2+8 ) { step2+=8; //} //} //} for( step2=0; x>1L<<step2+8; ) { step2+=8; } //printf("step2 %d\n",step2); for( step = 0; x>1L<<(step+step2); ) { step+=1; //printf("step %d\n",step+step2); } printf("log2(%ld) %d\n",x,step+step2); return(step+step2); }

这种方法使用二进制search

 unsigned int Log2_b(unsigned long x) { unsigned long n = x; unsigned int bits = sizeof(x)*8; unsigned int hbit = bits-1; unsigned int lbit = 0; unsigned long guess = bits/2; int found = 0; while ( hbit-lbit>1 ) { //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit); //when value between guess..lbit if( (x<=(1L<<guess)) ) { //printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess); hbit=guess; guess=(hbit+lbit)/2; //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit); } //when value between hbit..guess //else if( (x>(1L<<guess)) ) { //printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess); lbit=guess; guess=(hbit+lbit)/2; //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit); } } if( (x>(1L<<guess)) ) ++guess; printf("log2(x%ld)=r%d\n",x,guess); return(guess); }

Another binary search method, perhaps more readable,

 unsigned int Log2_c(unsigned long x) { unsigned long v = x; unsigned int bits = sizeof(x)*8; unsigned int step = bits; unsigned int res = 0; for( step = bits/2; step>0; ) { //printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step); while ( v>>step ) { v>>=step; res+=step; //printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v); } step /= 2; } if( (x>(1L<<res)) ) ++res; printf("log2(x%ld)=r%ld\n",x,res); return(res); }

And because you will want to test these,

 int main() { unsigned long int x = 3; for( x=2; x<1000000000; x*=2 ) { //printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1)); printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1)); printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1)); printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1)); } return(0); }

Putting this in since it's 'yet another' approach, seems to be different from others already given.

returns -1 if x==0 , otherwise floor( log2(x)) (max result 31)

Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.

This is what I use when I don't want to use __builtin_clz because of portability issues.

To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.

 int log2floor( unsigned x ){ static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3}; int r = 0; unsigned xk = x >> 16; if( xk != 0 ){ r = 16; x = xk; } // x is 0 .. 0xFFFF xk = x >> 8; if( xk != 0){ r += 8; x = xk; } // x is 0 .. 0xFF xk = x >> 4; if( xk != 0){ r += 4; x = xk; } // now x is 0..15; x=0 only if originally zero. return r + wtab[x]; }

I know this question is very old, but just having implemented an msb() function myself, I found that most solutions presented here and on other websites are not necessarily the most efficient – at least for my personal definition of efficiency (see also Update below). 原因如下：

Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01 . See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.

It can be shown ¹ , that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).

Of course, the worst case is still O(n) , worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications ( Update : not quite: There may be few, but they might occur with high probability – see Update below).

Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log ₂ (32) = 5 steps, whereas this silly algorithm requires less than 2 on average) – sorry for this being C++ and not pure C:

 template <typename T> auto msb(T n) -> int { static_assert(std::is_integral<T>::value && !std::is_signed<T>::value, "msb<T>(): T must be an unsigned integral type."); for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1) { if ((n & mask) != 0) return i; } return 0; }

Update : While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size . My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t ). In this case, my solution will actually perform worse than a binary search approach – so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end – despite the fact that it's amortized O(1) for truly arbitrary integers.

¹ The argument goes like this (rough draft): Let n be the number of bits (bit-width). There are a total of 2 ⁿ integers wich can be represented with n bits. There are 2 ^{n – 1} integers starting with a 1 (first 1 is fixed, remaining n – 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2 ^{n – 2} integers starting with 01 , requiring 2 iterations, 2 ^{n – 3} integers starting with 001 , requiring 3 iterations, and so on.

If we sum up all the required iterations for all possible integers and divide them by 2 ⁿ , the total number of integers, we get the average number of iterations needed for determining the MSB for n -bit integers:

(1 * 2 ^{n – 1} + 2 * 2 ^{n – 2} + 3 * 2 ^{n – 3} + … + n) / 2 ⁿ

This series of average iterations is actually convergent and has a limit of 2 for n towards infinity

Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.

代码：

  // x>=1; unsigned func(unsigned x) { double d = x ; int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023; printf( "The left-most non zero bit of %d is bit %d\n", x, p); }

Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1

Woaw, that was many answers. I am not sorry for answering on an old question.

 int result = 0;//could be a char or int8_t instead if(value){//this assumes the value is 64bit if(0xFFFFFFFF00000000&value){ value>>=(1<<5); result|=(1<<5); }//if it is 32bit then remove this line if(0x00000000FFFF0000&value){ value>>=(1<<4); result|=(1<<4); }//and remove the 32msb if(0x000000000000FF00&value){ value>>=(1<<3); result|=(1<<3); } if(0x00000000000000F0&value){ value>>=(1<<2); result|=(1<<2); } if(0x000000000000000C&value){ value>>=(1<<1); result|=(1<<1); } if(0x0000000000000002&value){ result|=(1<<0); } }else{ result=-1; }

This answer is pretty similar to another answer… oh well.

Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 64K of memory instead of just 256 lookup entries) here is a solution using a 16-bit lookup table , in C# 7 for .NET .

The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal . As you can see, for maximum performance, the whole example is written as native:

 public static unsafe class MyStaticClass { readonly static sbyte* highest_one_16; static MyStaticClass() { // Initialize a table of 65536 bytes with the bit position (counting from LSB=0) // of the highest 'set' (non-zero) bit in each corresponding 16-bit index. var p = (sbyte*)Marshal.AllocHGlobal(0x10000); *p++ = -1; sbyte n = 0; for (int c = 1; n < 16; c <<= 1, n++) for (int i = 0; i < c; i++) *p++ = n; highest_one_16 = p - 0x10000; } // continued below ...

The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log ₂ , which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).

Notice that the table entry for 0 , the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1 . This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:

  // continued from above... public static int HighestOne(sbyte i8) => highest_one_16[i8]; public static int HighestOne(byte ui8) => highest_one_16[ui8]; public static int HighestOne(char ch) => highest_one_16[ch]; public static int HighestOne(short i16) => highest_one_16[i16]; public static int HighestOne(ushort ui16) => highest_one_16[ui16]; public static int HighestOne(int i32) => HighestOne((uint)i32); public static int HighestOne(uint ui32) { int i; return (i = highest_one_16[ui32 >> 16]) < 0 ? highest_one_16[(ushort)ui32] : (i + 16); } public static int HighestOne(long i64) => HighestOne((ulong)i64); public static int HighestOne(ulong ui64) { int i, j; return (i = highest_one_16[ui64 >> (j = 48)]) < 0 && (i = highest_one_16[ui64 >> (j = 32)]) < 0 && (i = highest_one_16[ui64 >> (j = 16)]) < 0 ? highest_one_16[(ushort)ui64] : (i + j); } };

This is a complete working solution which has been tested via being battered in several heavy use apps for nearly a decade as shown (not counting cosmetic code updates for C# 7).

One approach could be to keep shifting left till the number becomes negative.

这里是代码：

 Funct() { int number; int count; while(number > 0) { number = number << 1; count++; } printf("It is the no "%d" bit from the left", (count+1)); }

什么是最快/最有效的方法来find一个整数在C中的最高设置位（MSB）？

添加到词典的不同方式

为什么Magento这么慢？

铿锵VS海湾合作委员会 – 哪个产生更好的二进制？

什么专栏通常做出好的指标？

为什么优化杀死这个function？

位图性能优化模式

返回64位整数中所有设置位的位置的最快方法是什么？

为什么lambda可以比普通函数更好地被编译器优化？

如何处理Java BigDecimal的性能？

用64位代替32位循环计数variables引入了疯狂的性能偏差