寻找ARM Thumb2的有效整数平方根algorithm

我正在寻找一个快速的整数唯一algorithm来find无符号整数的平方根(整数部分)。 该代码必须在ARM Thumb 2处理器上具有出色的性能。 它可以是汇编语言或C代码。

任何提示欢迎。

Jack W. Crenshaw的Integer Square Roots可以作为另一个参考。

C Snippets Archive也有一个整数平方根的实现 。 这个超越了整数结果,并且计算了答案的额外小数(定点)位。 (更新:不幸的是,C代码片段存档已经失效,链接指向页面的Web存档。)以下是C片段存档代码:

#define BITSPERLONG 32 #define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2)) struct int_sqrt { unsigned sqrt, frac; }; /* usqrt: ENTRY x: unsigned long EXIT returns floor(sqrt(x) * pow(2, BITSPERLONG/2)) Since the square root never uses more than half the bits of the input, we use the other half of the bits to contain extra bits of precision after the binary point. EXAMPLE suppose BITSPERLONG = 32 then usqrt(144) = 786432 = 12 * 65536 usqrt(32) = 370727 = 5.66 * 65536 NOTES (1) change BITSPERLONG to BITSPERLONG/2 if you do not want the answer scaled. Indeed, if you want n bits of precision after the binary point, use BITSPERLONG/2+n. The code assumes that BITSPERLONG is even. (2) This is really better off being written in assembly. The line marked below is really a "arithmetic shift left" on the double-long value with r in the upper half and x in the lower half. This operation is typically expressible in only one or two assembly instructions. (3) Unrolling this loop is probably not a bad idea. ALGORITHM The calculations are the base-two analogue of the square root algorithm we all learned in grammar school. Since we're in base 2, there is only one nontrivial trial multiplier. Notice that absolutely no multiplications or divisions are performed. This means it'll be fast on a wide range of processors. */ void usqrt(unsigned long x, struct int_sqrt *q) { unsigned long a = 0L; /* accumulator */ unsigned long r = 0L; /* remainder */ unsigned long e = 0L; /* trial product */ int i; for (i = 0; i < BITSPERLONG; i++) /* NOTE 1 */ { r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */ a <<= 1; e = (a << 1) + 1; if (r >= e) { r -= e; a++; } } memcpy(q, &a, sizeof(long)); } 

我解决了下面的代码。 它基本上来自维基百科有关平方根计算方法的文章 。 但是已经改为使用stdint.htypesuint32_t等。严格来说,返回types可以改为uint16_t

 /** * \brief Fast Square root algorithm * * Fractional parts of the answer are discarded. That is: * - SquareRoot(3) --> 1 * - SquareRoot(4) --> 2 * - SquareRoot(5) --> 2 * - SquareRoot(8) --> 2 * - SquareRoot(9) --> 3 * * \param[in] a_nInput - unsigned integer for which to find the square root * * \return Integer square root of the input value. */ uint32_t SquareRoot(uint32_t a_nInput) { uint32_t op = a_nInput; uint32_t res = 0; uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type // "one" starts at the highest power of four <= than the argument. while (one > op) { one >>= 2; } while (one != 0) { if (op >= res + one) { op = op - (res + one); res = res + 2 * one; } res >>= 1; one >>= 2; } return res; } 

我发现,好的一点是,一个相当简单的修改可以返回“四舍五入”的答案。 我发现这在某些应用程序中有用于更高的准确性。 请注意,在这种情况下,返回types必须是uint32_t因为2 32 – 1的圆angular平方根是2 16

 /** * \brief Fast Square root algorithm, with rounding * * This does arithmetic rounding of the result. That is, if the real answer * would have a fractional part of 0.5 or greater, the result is rounded up to * the next integer. * - SquareRootRounded(2) --> 1 * - SquareRootRounded(3) --> 2 * - SquareRootRounded(4) --> 2 * - SquareRootRounded(6) --> 2 * - SquareRootRounded(7) --> 3 * - SquareRootRounded(8) --> 3 * - SquareRootRounded(9) --> 3 * * \param[in] a_nInput - unsigned integer for which to find the square root * * \return Integer square root of the input value. */ uint32_t SquareRootRounded(uint32_t a_nInput) { uint32_t op = a_nInput; uint32_t res = 0; uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type // "one" starts at the highest power of four <= than the argument. while (one > op) { one >>= 2; } while (one != 0) { if (op >= res + one) { op = op - (res + one); res = res + 2 * one; } res >>= 1; one >>= 2; } /* Do arithmetic rounding to nearest integer */ if (op > res) { res++; } return res; } 

如果不需要准确的准确性,我们有一个快速的近似值,使用260bytes的RAM(你可以减半,但不要)。

 int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340}; int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977}; int fisqrt(int val) { int cnt=0; int t=val; while (t) {cnt++;t>>=1;} if (6>=cnt) t=(val<<(6-cnt)); else t=(val>>(cnt-6)); return (ftbl[cnt]*ftbl2[t&31])>>15; } 

以下是生成表的代码:

 ftbl[0]=0; for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i)); printf("int ftbl[33]={0"); for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]); printf("};\n"); for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768; printf("int ftbl2[32]={"); for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]); printf("};\n"); 

在1-> 2 ^ 20的范围内,最大错误是11,在1-> 2 ^ 30的范围内,它大约是256.您可以使用更大的表并将其最小化。 值得一提的是,这个错误总是会是负面的,即错误的时候,这个值会比正确的值小。

你可能会好好跟随这个炼油阶段。

这个想法很简单:(ab)^ 0.5 = a ^ 0.b * b ^ 0.5。

所以,我们取inputX = A * B,其中A = 2 ^ N且1 <= B <2然后我们有sqrt(2 ^ N)的查找表和sqrt(1 <= B <2) 。 我们将sqrt(2 ^ N)的查找表存储为整数,这可能是一个错误(testing显示没有不良影响),我们将sqrt(1 <= B <2)的查找存储在15位定点处。

我们知道1 <= sqrt(2 ^ N)<65536,所以这是16位,我们知道我们真的只能在ARM上乘16位×15位,而不用担心报复,所以这就是我们所做的。

在实现方面,while(t){cnt ++; t >> = 1;}实际上是一个计数前导位指令(CLB),所以如果你的芯片组版本有这个,你就赢了! 另外,如果你有一个双向移位器,换挡指令很容易实现? 在这里有一个Lg [N]algorithm来计算最高设置位。

在幻数方面,对于改变表格的大小,ftbl2的幻数是32,但是要注意的是6 (Lg [32] +1)用于移位。

一个常见的方法是平分。

 hi = number lo = 0 mid = ( hi + lo ) / 2 mid2 = mid*mid while( lo < hi-1 and mid2 != number ) { if( mid2 < number ) { lo = mid else hi = mid mid = ( hi + lo ) / 2 mid2 = mid*mid 

类似的东西应该工作得很好。 它做log2(数字)testing,做log2(数字)相乘和除法。 由于除法是除以2,所以可以用>>replace它。

终止条件可能不是正确的,所以一定要testing各种整数,以确保除以2之间的差值不会在两个偶数值之间错误地摆动; 他们相差超过1。

这不是快,但它很小,很简单:

 int isqrt(int n) { int b = 0; while(n >= 0) { n = n - b; b = b + 1; n = n - b; } return b - 1; } 

这取决于sqrt函数的用法。 我经常使用一些约快速版本。 例如,当我需要计算vector模块时:

 Module = SQRT( x^2 + y^2) 

我用 :

 Module = MAX( x,y) + Min(x,y)/2 

其中可以用3或4个指令编码为:

 If (x > y ) Module = x + y >> 1; Else Module = y + x >> 1; 

我发现大多数algorithm都是基于简单的思想,但是以一种更为复杂的方式来实现,而不是必要的。 我已经从这里获得了这个想法: http ://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf(由Ross M. Fosler撰写),并将其变成了一个非常短的C函数:

 uint16_t int_sqrt32(uint32_t x) { uint16_t res=0; uint16_t add= 0x8000; int i; for(i=0;i<16;i++) { uint16_t temp=res | add; uint32_t g2=temp*temp; if (x>=g2) { res=temp; } add>>=1; } return res; } 

这在我的blackfin编译为5个周期/位。 我相信如果你使用for循环而不是while循环,你的编译代码通常会更快,而且你得到了确定性时间的额外好处(尽pipe这在一定程度上取决于你的编译器如何优化if语句)。

我已经解决了类似于这个维基百科文章中描述的二进制数字逐位algorithm的东西。

这是Java中的一个解决scheme,它将整数log_2和牛顿方法结合起来,创build一个无循环algorithm。 作为缺点,它需要分工。 注释行需要上转换为64位algorithm。

 private static final int debruijn= 0x07C4ACDD; //private static final long debruijn= ( ~0x0218A392CD3D5DBFL)>>>6; static { for(int x= 0; x<32; ++x) { final long v= ~( -2L<<(x)); DeBruijnArray[(int)((v*debruijn)>>>27)]= x; //>>>58 } for(int x= 0; x<32; ++x) SQRT[x]= (int) (Math.sqrt((1L<<DeBruijnArray[x])*Math.sqrt(2))); } public static int sqrt(final int num) { int y; if(num==0) return num; { int v= num; v|= v>>>1; // first round up to one less than a power of 2 v|= v>>>2; v|= v>>>4; v|= v>>>8; v|= v>>>16; //v|= v>>>32; y= SQRT[(v*debruijn)>>>27]; //>>>58 } //y= (y+num/y)>>>1; y= (y+num/y)>>>1; y= (y+num/y)>>>1; y= (y+num/y)>>>1; return y*y>num?y-1:y; } 

这是如何工作的:第一部分产生一个精确到约三位的平方根。 行[y =(y + num / y)>> 1;]将位精度加倍。 最后一行消除了可以产生的屋顶根。

我在C#中实现了Warren的build议和牛顿方法,用于64位整数。 Isqrt使用Newton方法,而Isqrt使用Warren的方法。 这里是源代码:

 using System; namespace Cluster { public static class IntegerMath { /// <summary> /// Compute the integer square root, the largest whole number less than or equal to the true square root of N. /// /// This uses the integer version of Newton's method. /// </summary> public static long Isqrt(this long n) { if (n < 0) throw new ArgumentOutOfRangeException("n", "Square root of negative numbers is not defined."); if (n <= 1) return n; var xPrev2 = -2L; var xPrev1 = -1L; var x = 2L; // From Wikipedia: if N + 1 is a perfect square, then the algorithm enters a two-value cycle, so we have to compare // to two previous values to test for convergence. while (x != xPrev2 && x != xPrev1) { xPrev2 = xPrev1; xPrev1 = x; x = (x + n/x)/2; } // The two values x and xPrev1 will be above and below the true square root. Choose the lower one. return x < xPrev1 ? x : xPrev1; } #region Sqrt using Bit-shifting and magic numbers. // From http://stackoverflow.com/questions/1100090/looking-for-an-efficient-integer-square-root-algorithm-for-arm-thumb2 // Converted to C#. private static readonly ulong debruijn= ( ~0x0218A392CD3D5DBFUL)>>6; private static readonly ulong[] SQRT = new ulong[64]; private static readonly int[] DeBruijnArray = new int[64]; static IntegerMath() { for(int x= 0; x<64; ++x) { ulong v= (ulong) ~( -2L<<(x)); DeBruijnArray[(v*debruijn)>>58]= x; } for(int x= 0; x<64; ++x) SQRT[x]= (ulong) (Math.Sqrt((1L<<DeBruijnArray[x])*Math.Sqrt(2))); } public static long Isqrt2(this long n) { ulong num = (ulong) n; ulong y; if(num==0) return (long)num; { ulong v= num; v|= v>>1; // first round up to one less than a power of 2 v|= v>>2; v|= v>>4; v|= v>>8; v|= v>>16; v|= v>>32; y= SQRT[(v*debruijn)>>58]; } y= (y+num/y)>>1; y= (y+num/y)>>1; y= (y+num/y)>>1; y= (y+num/y)>>1; // Make sure that our answer is rounded down, not up. return (long) (y*y>num?y-1:y); } #endregion } } 

我用下面的代码进行基准testing:

 using System; using System.Diagnostics; using Cluster; using Microsoft.VisualStudio.TestTools.UnitTesting; namespace ClusterTests { [TestClass] public class IntegerMathTests { [TestMethod] public void Isqrt_Accuracy() { for (var n = 0L; n <= 100000L; n++) { var expectedRoot = (long) Math.Sqrt(n); var actualRoot = n.Isqrt(); Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n)); } } [TestMethod] public void Isqrt2_Accuracy() { for (var n = 0L; n <= 100000L; n++) { var expectedRoot = (long)Math.Sqrt(n); var actualRoot = n.Isqrt2(); Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n)); } } [TestMethod] public void Isqrt_Speed() { var integerTimer = new Stopwatch(); var libraryTimer = new Stopwatch(); integerTimer.Start(); var total = 0L; for (var n = 0L; n <= 300000L; n++) { var root = n.Isqrt(); total += root; } integerTimer.Stop(); libraryTimer.Start(); total = 0L; for (var n = 0L; n <= 300000L; n++) { var root = (long)Math.Sqrt(n); total += root; } libraryTimer.Stop(); var isqrtMilliseconds = integerTimer.ElapsedMilliseconds; var libraryMilliseconds = libraryTimer.ElapsedMilliseconds; var msg = String.Format("Isqrt: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds); Debug.WriteLine(msg); Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg); } [TestMethod] public void Isqrt2_Speed() { var integerTimer = new Stopwatch(); var libraryTimer = new Stopwatch(); var warmup = (10L).Isqrt2(); integerTimer.Start(); var total = 0L; for (var n = 0L; n <= 300000L; n++) { var root = n.Isqrt2(); total += root; } integerTimer.Stop(); libraryTimer.Start(); total = 0L; for (var n = 0L; n <= 300000L; n++) { var root = (long)Math.Sqrt(n); total += root; } libraryTimer.Stop(); var isqrtMilliseconds = integerTimer.ElapsedMilliseconds; var libraryMilliseconds = libraryTimer.ElapsedMilliseconds; var msg = String.Format("isqrt2: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds); Debug.WriteLine(msg); Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg); } } } 

戴尔Latitude E6540处于发行模式,Visual Studio 2012的结果是图书馆调用Math.Sqrt的速度更快。

  • 59毫秒 – 牛顿(Isqrt)
  • 12 ms – 位移(Isqrt2)
  • 5毫秒 – Math.Sqrt

我对编译器指令并不是很聪明,因此可以调整编译器以更快地获得整数运算。 显然,这种转移方法与图书馆非常接近。 在没有math协处理器的系统上,速度会非常快。

这种方法类似于长时间的分割:你为根的下一个数字构build一个猜测,做一个减法,如果差别符合一定的标准,input数字。 对于二进制版本,下一个数字的唯一select是0或1,所以你总是猜测1,做减法,input1,除非差别是负的。

http://www.realitypixels.com/turk/opensource/index.html#FractSqrt