最快的固定长度6 int数组

回答另一个堆栈溢出问题( 这一个 )我偶然发现了一个有趣的子问题。 对6个整数的数组进行sorting的最快方法是什么?

由于问题是非常低的水平:

  • 我们不能假定图书馆是可用的(而且这个调用本身也有其成本),只有普通的C
  • 为避免清空指令stream水线(成本非常高),我们应该尽量减less分支,跳转以及其他任何types的控制stream程中断(例如隐藏在&&或||中序列点后面的stream程)。
  • 空间受到限制,最大限度地减less寄存器和内存使用是一个问题,理想情况下,sorting可能是最好的。

真的这个问题是一种高尔夫,目的不是要尽量减less来源长度,而是执行时间。 我将其称为“Zening”代码,用于Michael Abrash所着的“代码优化之禅”及其续篇的书名。

至于为什么它很有趣,有几层:

  • 这个例子很简单,容易理解和测量,涉及的C技能不多
  • 它显示了对于这个问题select一个好的algorithm的效果,还显示了编译器和底层硬件的效果。

这是我的参考(天真,没有优化)的实现和我的testing集。

#include <stdio.h> static __inline__ int sort6(int * d){ char j, i, imin; int tmp; for (j = 0 ; j < 5 ; j++){ imin = j; for (i = j + 1; i < 6 ; i++){ if (d[i] < d[imin]){ imin = i; } } tmp = d[j]; d[j] = d[imin]; d[imin] = tmp; } } static __inline__ unsigned long long rdtsc(void) { unsigned long long int x; __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } int main(int argc, char ** argv){ int i; int d[6][5] = { {1, 2, 3, 4, 5, 6}, {6, 5, 4, 3, 2, 1}, {100, 2, 300, 4, 500, 6}, {100, 2, 3, 4, 500, 6}, {1, 200, 3, 4, 5, 600}, {1, 1, 2, 1, 2, 1} };   unsigned long long cycles = rdtsc();   for (i = 0; i < 6 ; i++){   sort6(d[i]);   /*      * printf("d%d : %d %d %d %d %d %d\n", i,    *  d[i][0], d[i][6], d[i][7],    *  d[i][8], d[i][9], d[i][10]);     */   }   cycles = rdtsc() - cycles;   printf("Time is %d\n", (unsigned)cycles); } 

原始结果

随着变体的数量变得越来越大,我把它们全部收集在一个可以在这里find的testing套件中。 实际使用的testing比上面显示的less一些幼稚,这要感谢Kevin Stock。 您可以在自己的环境中编译和执行它。 我对不同的目标体系结构/编译器的行为很感兴趣。 (好吧,把它放在答案中,我会+1每个贡献者的新结果集)。

一年前,我对Daniel Stutzbach(打高尔夫)的答案给出了答案,因为他当时是最快解决scheme的来源(sortingnetworking)。

Linux 64位,gcc 4.6.1 64位,Intel Core 2 Duo E8400,-O2

  • 直接调用qsort库函数:689.38
  • 天真的实现(插入sorting):285.70
  • 插入sorting(Daniel Stutzbach):142.12
  • 插入sorting展开:125.47
  • 排名:102.26
  • 排名顺序与登记:58.03
  • sortingnetworking(Daniel Stutzbach):111.68
  • sortingnetworking(Paul R):66.36
  • 使用快速交换对networking12进行sorting:58.86
  • sortingnetworking12重新sorting交换:53.74
  • sortingnetworking12重新sorting简单交换:31.54
  • 重新sorting的sortingnetworkingw /快速交换:31.54
  • 重新sorting的sortingnetworkingw /快速交换V2:33.63
  • 内联泡沫sorting(保罗Bonzini):48.85
  • 展开插入sorting(Paolo Bonzini):75.30

Linux 64位,gcc 4.6.1 64位,Intel Core 2 Duo E8400,-O1

  • 直接调用qsort库函数:705.93
  • 天真的实现(插入sorting):135.60
  • 插入sorting(Daniel Stutzbach):142.11
  • 插入sorting展开:126.75
  • 排名:46.42
  • 排名顺序与登记:43.58
  • sortingnetworking(Daniel Stutzbach):115.57
  • sortingnetworking(Paul R):64.44
  • 用快速交换对networking12进行sorting:61.98
  • sortingnetworking12重新sorting交换:54.67
  • sortingnetworking12重新sorting简单交换:31.54
  • 重新sorting的sortingnetworkingw /快速交换:31.24
  • 重新sorting的sortingnetworking/快速交换V2:33.07
  • 内联泡沫sorting(保罗Bonzini):45.79
  • 展开插入sorting(Paolo Bonzini):80.15

我包括了-O1和-O2的结果,因为对于几个程序来说令人惊讶的是O2比O1效率低。 我不知道有什么具体的优化有这种效果?

提出解决scheme

插入sorting(Daniel Stutzbach)

如预期的那样,最小化分支确实是一个好主意。

networkingsorting(Daniel Stutzbach)

比插入sorting更好。 我想知道主要的效果是不是避免了外部循环。 我试着通过展开插入sorting来检查,实际上我们得到大致相同的数字(代码在这里 )。

分类networking(Paul R)

迄今为止最好的。 我用来testing的实际代码在这里 。 不知道为什么它是其他分拣networking实施的近两倍。 parameter passing? 快速最大?

sortingnetworking12快速交换SWAP

正如Daniel Stutzbach所build议的那样,我将他的12个交换分类networking与无分支快速交换(代码在这里 )结合起来。 这确实是更快,迄今为止最好的一个小利润率(大约5%),可以预期,减less1个交换。

同样有趣的是,无networking交换似乎比使用PPC架构的简单交换效率低了许多(4倍)。

调用库qsort

为了给出另一个参考点,我也尝试build议只是调用库qsort(代码在这里 )。 正如预期的那样,速度要慢得多:慢10到30倍…随着新testing套件的出现,主要问题似乎是第一次调用后库的初始加载,而且与其他软件版。 在我的Linux上,它只是慢了3到20倍。 在一些被别人用于testing的架构上,似乎甚至更快(我真的很惊讶,因为库qsort使用更复杂的API)。

排名顺序

雷克斯·克尔提出了另一种完全不同的方法:对数组中的每一项直接计算其最终位置。 这是有效的,因为计算等级顺序不需要分支。 这种方法的缺点是它需要三倍的数组内存量(一个数组的副本和variables来存储等级顺序)。 performance的结果是非常令人惊讶的(和有趣的)。 在我使用32位操作系统和Intel Core2 Quad E8300的参考架构上,周期数略低于1000(例如分类交换networking)。 但是,当我在64位盒(Intel Core2 Duo)上编译和执行时,性能performance要好得多:到目前为止是最快的。 我终于find了真正的原因。 我的32位盒子使用gcc 4.4.1和我的64位框gcc 4.4.3和最后一个似乎更好地优化这个特定的代码(其他提案几乎没有什么区别)。

更新

正如上面公布的数字所显示的那样,这个效果在后来的gcc版本中还是得到了提高,排名顺序也比任何其他的select都快了两倍。

用重新sorting的交换sortingnetworking12

使用gcc 4.4.3的Rex Kerr提议令人惊叹的效率使我怀疑:如何使用3倍的内存使用率的程序比无分类的分类networking更快? 我的假设是写入之后的types读取依赖性较小,可以更好地使用x86的超标量指令调度程序。 这给了我一个想法:重新sorting交换以最小化写入依赖关系之后的读取。 更简单地说:当你做SWAP(1, 2); SWAP(0, 2); SWAP(1, 2); SWAP(0, 2); 在执行第二个交换之前,你必须等待第一个交换完成,因为它们都访问一个公共存储单元。 当你做SWAP(1, 2); SWAP(4, 5); SWAP(1, 2); SWAP(4, 5); 处理器可以并行执行。 我试过了,它按预期工作,分类networking运行速度提高了10%左右。

用简单的交换对networking进行分类

原Steinar H. Gundersonbuild议的一年后,我们不应该试图智取编译器,并保持交换代码的简单。 这确实是一个好主意,因为结果代码快了大约40%! 他还提出了一个使用x86内联汇编代码手动优化的交换,可以节省更多的周期。 最令人惊讶的是(程序员的心理学卷)是一年前没有人使用过这个版本的交换。 我用来testing的代码在这里 。 其他人build议其他的方式来写一个C​​快速交换,但它产生与一个体面的编译器简单的性能相同的性能。

“最好”的代码如下:

 static inline void sort6_sorting_network_simple_swap(int * d){ #define min(x, y) (x<y?x:y) #define max(x, y) (x<y?y:x) #define SWAP(x,y) { const int a = min(d[x], d[y]); \ const int b = max(d[x], d[y]); \ d[x] = a; d[y] = b; } SWAP(1, 2); SWAP(4, 5); SWAP(0, 2); SWAP(3, 5); SWAP(0, 1); SWAP(3, 4); SWAP(1, 4); SWAP(0, 3); SWAP(2, 5); SWAP(1, 3); SWAP(2, 4); SWAP(2, 3); #undef SWAP #undef min #undef max } 

如果我们相信我们的testing集(并且是的,它很差,仅仅是短小的,简单的和易于理解我们正在测量的好处),一种types的结果代码的平均循环数低于40个循环( 6个testing被执行)。 这使得每个交换平均为4个周期。 我称之为非常快。 任何其他可能的改进?

对于任何优化,总是最好的testing,testing和testing。 我会尝试至lesssortingnetworking和插入sorting。 如果我打赌,我会根据过去的经验,把我的钱放在sorting上。

你对input数据有什么了解吗? 某些algorithm在某些types的数据上performance更好。 例如,插入sorting在sorting或几乎sorting的数据上执行得更好,所以如果sorting数据的几率高于平均水平,则sorting会更好。

您发布的algorithm类似于插入sorting,但是看起来您已经将交换次数最小化,但需要进行更多的比较。 虽然比较远比交换更昂贵,因为分支可能导致指令stream水线停顿。

这是一个插入sorting实现:

 static __inline__ int sort6(int *d){ int i, j; for (i = 1; i < 6; i++) { int tmp = d[i]; for (j = i; j >= 1 && tmp < d[j-1]; j--) d[j] = d[j-1]; d[j] = tmp; } } 

以下是我将如何build立一个sortingnetworking。 首先,使用此站点为适当长度的networking生成一组最小的SWAPmacros。 用一个函数包装起来给我:

 static __inline__ int sort6(int * d){ #define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; } SWAP(1, 2); SWAP(0, 2); SWAP(0, 1); SWAP(4, 5); SWAP(3, 5); SWAP(3, 4); SWAP(0, 3); SWAP(1, 4); SWAP(2, 5); SWAP(2, 4); SWAP(1, 3); SWAP(2, 3); #undef SWAP } 

这是一个使用sortingnetworking的实现:

 inline void Sort2(int *p0, int *p1) { const int temp = min(*p0, *p1); *p1 = max(*p0, *p1); *p0 = temp; } inline void Sort3(int *p0, int *p1, int *p2) { Sort2(p0, p1); Sort2(p1, p2); Sort2(p0, p1); } inline void Sort4(int *p0, int *p1, int *p2, int *p3) { Sort2(p0, p1); Sort2(p2, p3); Sort2(p0, p2); Sort2(p1, p3); Sort2(p1, p2); } inline void Sort6(int *p0, int *p1, int *p2, int *p3, int *p4, int *p5) { Sort3(p0, p1, p2); Sort3(p3, p4, p5); Sort2(p0, p3); Sort2(p2, p5); Sort4(p1, p2, p3, p4); } 

你真的需要非常高效的无分支minmax实现,因为这实际上是代码归结为什么 – 一个minmax操作序列(总共13个)。 我把这个留给读者做一个练习。

注意,这个实现很容易实现vector化(例如,SIMD – 大多数SIMD ISA有vector最小/最大指令),也适用于GPU实现(例如CUDA – 无分支,没有经向偏差等问题)。

另请参见: 快速algorithm实现sorting非常小的列表

由于这些是整数,比较快,为什么不直接计算每个的sorting:

 inline void sort6(int *d) { int e[6]; memcpy(e,d,6*sizeof(int)); int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]); int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]); int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]); int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]); int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]); int o5 = 15-(o0+o1+o2+o3+o4); d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5]; } 

看起来我迟到了一年,但是我们走了…

看着由gcc 4.5.2生成的程序集,我发现加载和存储正在为每个交换完成,而这实际上并不需要。 最好将6个值加载到寄存器中,对这些值进行sorting,并将其存储回内存中。 我下令在商店的负载尽可能靠近那里首先需要和最后使用的寄存器。 我还使用Steinar H. Gunderson的SWAPmacros。 更新:我切换到Paolo Bonzini的SWAPmacros,gcc将其转换成与Gunderson类似的东西,但gcc能够更好地排列指令,因为它们没有作为明确的汇编给出。

我使用了同样的交换顺序作为最好的performance交换的交换networking,虽然可能有一个更好的顺序。 如果我find更多的时间,我会产生和testing一堆排列。

我改变了testing代码,以考虑超过4000个arrays,并显示每个sorting所需的平均周期数。 在i5-650我得到约34.1周期/sorting(使用-O3),比原来的重新sortingnetworking约65.3个周期/sorting(使用-O1,节拍-O2和-O3)。

 #include <stdio.h> static inline void sort6_fast(int * d) { #define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; } register int x0,x1,x2,x3,x4,x5; x1 = d[1]; x2 = d[2]; SWAP(x1, x2); x4 = d[4]; x5 = d[5]; SWAP(x4, x5); x0 = d[0]; SWAP(x0, x2); x3 = d[3]; SWAP(x3, x5); SWAP(x0, x1); SWAP(x3, x4); SWAP(x1, x4); SWAP(x0, x3); d[0] = x0; SWAP(x2, x5); d[5] = x5; SWAP(x1, x3); d[1] = x1; SWAP(x2, x4); d[4] = x4; SWAP(x2, x3); d[2] = x2; d[3] = x3; #undef SWAP #undef min #undef max } static __inline__ unsigned long long rdtsc(void) { unsigned long long int x; __asm__ volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx"); return x; } void ran_fill(int n, int *a) { static int seed = 76521; while (n--) *a++ = (seed = seed *1812433253 + 12345); } #define NTESTS 4096 int main() { int i; int d[6*NTESTS]; ran_fill(6*NTESTS, d); unsigned long long cycles = rdtsc(); for (i = 0; i < 6*NTESTS ; i+=6) { sort6_fast(d+i); } cycles = rdtsc() - cycles; printf("Time is %.2lf\n", (double)cycles/(double)NTESTS); for (i = 0; i < 6*NTESTS ; i+=6) { if (d[i+0] > d[i+1] || d[i+1] > d[i+2] || d[i+2] > d[i+3] || d[i+3] > d[i+4] || d[i+4] > d[i+5]) printf("d%d : %d %d %d %d %d %d\n", i, d[i+0], d[i+1], d[i+2], d[i+3], d[i+4], d[i+5]); } return 0; } 

我改变了testing套件 ,也报告每个sorting的时钟,并运行更多的testing(cmp函数更新以处理整数溢出以及),这里是一些不同的体系结果。 我试图在AMD CPU上testing,但rdtsc是不可靠的X6 1100T我可用。

 Clarkdale (i5-650) ================== Direct call to qsort library function 635.14 575.65 581.61 577.76 521.12 Naive implementation (insertion sort) 538.30 135.36 134.89 240.62 101.23 Insertion Sort (Daniel Stutzbach) 424.48 159.85 160.76 152.01 151.92 Insertion Sort Unrolled 339.16 125.16 125.81 129.93 123.16 Rank Order 184.34 106.58 54.74 93.24 94.09 Rank Order with registers 127.45 104.65 53.79 98.05 97.95 Sorting Networks (Daniel Stutzbach) 269.77 130.56 128.15 126.70 127.30 Sorting Networks (Paul R) 551.64 103.20 64.57 73.68 73.51 Sorting Networks 12 with Fast Swap 321.74 61.61 63.90 67.92 67.76 Sorting Networks 12 reordered Swap 318.75 60.69 65.90 70.25 70.06 Reordered Sorting Network w/ fast swap 145.91 34.17 32.66 32.22 32.18 Kentsfield (Core 2 Quad) ======================== Direct call to qsort library function 870.01 736.39 723.39 725.48 721.85 Naive implementation (insertion sort) 503.67 174.09 182.13 284.41 191.10 Insertion Sort (Daniel Stutzbach) 345.32 152.84 157.67 151.23 150.96 Insertion Sort Unrolled 316.20 133.03 129.86 118.96 105.06 Rank Order 164.37 138.32 46.29 99.87 99.81 Rank Order with registers 115.44 116.02 44.04 116.04 116.03 Sorting Networks (Daniel Stutzbach) 230.35 114.31 119.15 110.51 111.45 Sorting Networks (Paul R) 498.94 77.24 63.98 62.17 65.67 Sorting Networks 12 with Fast Swap 315.98 59.41 58.36 60.29 55.15 Sorting Networks 12 reordered Swap 307.67 55.78 51.48 51.67 50.74 Reordered Sorting Network w/ fast swap 149.68 31.46 30.91 31.54 31.58 Sandy Bridge (i7-2600k) ======================= Direct call to qsort library function 559.97 451.88 464.84 491.35 458.11 Naive implementation (insertion sort) 341.15 160.26 160.45 154.40 106.54 Insertion Sort (Daniel Stutzbach) 284.17 136.74 132.69 123.85 121.77 Insertion Sort Unrolled 239.40 110.49 114.81 110.79 117.30 Rank Order 114.24 76.42 45.31 36.96 36.73 Rank Order with registers 105.09 32.31 48.54 32.51 33.29 Sorting Networks (Daniel Stutzbach) 210.56 115.68 116.69 107.05 124.08 Sorting Networks (Paul R) 364.03 66.02 61.64 45.70 44.19 Sorting Networks 12 with Fast Swap 246.97 41.36 59.03 41.66 38.98 Sorting Networks 12 reordered Swap 235.39 38.84 47.36 38.61 37.29 Reordered Sorting Network w/ fast swap 115.58 27.23 27.75 27.25 26.54 Nehalem (Xeon E5640) ==================== Direct call to qsort library function 911.62 890.88 681.80 876.03 872.89 Naive implementation (insertion sort) 457.69 236.87 127.68 388.74 175.28 Insertion Sort (Daniel Stutzbach) 317.89 279.74 147.78 247.97 245.09 Insertion Sort Unrolled 259.63 220.60 116.55 221.66 212.93 Rank Order 140.62 197.04 52.10 163.66 153.63 Rank Order with registers 84.83 96.78 50.93 109.96 54.73 Sorting Networks (Daniel Stutzbach) 214.59 220.94 118.68 120.60 116.09 Sorting Networks (Paul R) 459.17 163.76 56.40 61.83 58.69 Sorting Networks 12 with Fast Swap 284.58 95.01 50.66 53.19 55.47 Sorting Networks 12 reordered Swap 281.20 96.72 44.15 56.38 54.57 Reordered Sorting Network w/ fast swap 128.34 50.87 26.87 27.91 28.02 

testing代码非常糟糕; 它溢出初始数组(这里没有人读取编译器警告?),printf打印出错误的元素,它使用rdtsc的.byte没有很好的理由,只有一个运行(!),没有检查最后的结果实际上是正确的(所以很容易“优化”到某些细微的错误),所包含的testing是非常基本的(没有负数),没有什么能阻止编译器将整个函数作为死代码丢弃。

这就是说,在双向networking解决scheme上改进也相当容易。 只需改变最小/最大/交换的东西

 #define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); } 

它出来大约65%(Debian的gcc 4.4.5 -O2,AMD64,Core i7)。

几天前我偶然发现了Google的这个问题,因为我还需要快速sorting一个固定长度的6个整数的数组。 然而在我的情况下,我的整数只有8位(而不是32位),我没有严格的要求只使用C。我想我会分享我的研究结果,以防他们可能对某人有所帮助。

我在汇编中实现了一个networkingsorting的变体,它尽可能地使用SSE向量化比较和交换操作。 它需要六个“通行证”来完全sorting数组。 我使用了一种新的机制,只使用PADDB(向量化的加法),在某些情况下还使用PAND(按位与)指令,将PCMPGTB(向量化比较)的结果直接转换为PSHUFB(向量化交换)的混洗参数。

这种方法也有产生真正的无分支function的副作用。 没有任何跳转指示。

看起来这个实现比当前标记为问题中最快选项的实现(“用简单交换sortingnetworking12”) 快大约38% 。 我修改了这个实现,在我的testing过程中使用char数组元素,使比较公平。

我应该注意到,这种方法可以应用于任何大小最多16个元素的数组。 我预计相对于替代品的相对速度优势会越来越大,

该代码是用SSSE3写入MASM for x86_64处理器。 该函数使用“新”Windows x64调用约定。 这里是…

 PUBLIC simd_sort_6 .DATA ALIGN 16 pass1_shuffle OWORD 0F0E0D0C0B0A09080706040503010200h pass1_add OWORD 0F0E0D0C0B0A09080706050503020200h pass2_shuffle OWORD 0F0E0D0C0B0A09080706030405000102h pass2_and OWORD 00000000000000000000FE00FEFE00FEh pass2_add OWORD 0F0E0D0C0B0A09080706050405020102h pass3_shuffle OWORD 0F0E0D0C0B0A09080706020304050001h pass3_and OWORD 00000000000000000000FDFFFFFDFFFFh pass3_add OWORD 0F0E0D0C0B0A09080706050404050101h pass4_shuffle OWORD 0F0E0D0C0B0A09080706050100020403h pass4_and OWORD 0000000000000000000000FDFD00FDFDh pass4_add OWORD 0F0E0D0C0B0A09080706050403020403h pass5_shuffle OWORD 0F0E0D0C0B0A09080706050201040300h pass5_and OWORD 0000000000000000000000FEFEFEFE00h pass5_add OWORD 0F0E0D0C0B0A09080706050403040300h pass6_shuffle OWORD 0F0E0D0C0B0A09080706050402030100h pass6_add OWORD 0F0E0D0C0B0A09080706050403030100h .CODE simd_sort_6 PROC FRAME .endprolog ; pxor xmm4, xmm4 ; pinsrd xmm4, dword ptr [rcx], 0 ; pinsrb xmm4, byte ptr [rcx + 4], 4 ; pinsrb xmm4, byte ptr [rcx + 5], 5 ; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer. Same on extract ; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (eg Conroe/Merom) have slow pshufb. movd xmm4, dword ptr [rcx] pinsrw xmm4, word ptr [rcx + 4], 2 ; word 2 = bytes 4 and 5 movdqa xmm5, xmm4 pshufb xmm5, oword ptr [pass1_shuffle] pcmpgtb xmm5, xmm4 paddb xmm5, oword ptr [pass1_add] pshufb xmm4, xmm5 movdqa xmm5, xmm4 pshufb xmm5, oword ptr [pass2_shuffle] pcmpgtb xmm5, xmm4 pand xmm5, oword ptr [pass2_and] paddb xmm5, oword ptr [pass2_add] pshufb xmm4, xmm5 movdqa xmm5, xmm4 pshufb xmm5, oword ptr [pass3_shuffle] pcmpgtb xmm5, xmm4 pand xmm5, oword ptr [pass3_and] paddb xmm5, oword ptr [pass3_add] pshufb xmm4, xmm5 movdqa xmm5, xmm4 pshufb xmm5, oword ptr [pass4_shuffle] pcmpgtb xmm5, xmm4 pand xmm5, oword ptr [pass4_and] paddb xmm5, oword ptr [pass4_add] pshufb xmm4, xmm5 movdqa xmm5, xmm4 pshufb xmm5, oword ptr [pass5_shuffle] pcmpgtb xmm5, xmm4 pand xmm5, oword ptr [pass5_and] paddb xmm5, oword ptr [pass5_add] pshufb xmm4, xmm5 movdqa xmm5, xmm4 pshufb xmm5, oword ptr [pass6_shuffle] pcmpgtb xmm5, xmm4 paddb xmm5, oword ptr [pass6_add] pshufb xmm4, xmm5 ;pextrd dword ptr [rcx], xmm4, 0 ; benchmarked with this ;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version ;pextrb byte ptr [rcx + 5], xmm4, 5 movd dword ptr [rcx], xmm4 pextrw word ptr [rcx + 4], xmm4, 2 ; x86 is little-endian, so this is the right order ret simd_sort_6 ENDP END 

你可以把它编译成一个可执行的对象并把它连接到你的C工程中。 有关如何在Visual Studio中执行此操作的说明,请阅读这篇文章 。 你可以使用下面的C原型从你的C代码中调用函数:

 void simd_sort_6(char *values); 

虽然我真的很喜欢交换macros提供:

 #define min(x, y) (y ^ ((x ^ y) & -(x < y))) #define max(x, y) (x ^ ((x ^ y) & -(x < y))) #define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; } 

我看到了一个改进(编译器可能会做出这样的改进):

 #define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; } 

我们注意到最小和最大值是如何工作的,并且明确地将公共的子expression式拉进来。 这完全消除了最小和最大的macros。

永远不要在没有基准testing的情况下优化最小/ 如果我让海湾合作委员会有条件的移动指令优化最小我得到了33%的加速:

 #define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; } 

(在testing代码中为280与420个周期)。 用max做max:差不多都是一样的,差不多都在噪音中消失了,但是上面的速度要快一点。 GCC和Clang都会更快。

编译器在寄存器分配和别名分析方面也做了非常出色的工作,将d [x]有效地转移到本地variables上,最后只能复制回内存。 In fact, they do so even better than if you worked entirely with local variables (like d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5] ). I'm writing this because you are assuming strong optimization and yet trying to outsmart the compiler on min/max. 🙂

By the way, I tried Clang and GCC. They do the same optimization, but due to scheduling differences the two have some variation in the results, can't say really which is faster or slower. GCC is faster on the sorting networks, Clang on the quadratic sorts.

Just for completeness, unrolled bubble sort and insertion sorts are possible too. Here is the bubble sort:

 SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5); SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(0,1); SWAP(1,2); SWAP(0,1); 

and here is the insertion sort:

 //#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } } //Faster on x86, probably slower on ARM or similar: #define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; } static inline void sort6_insertion_sort_unrolled_v2(int * d){ int t; t = d[1]; ITER(0); t = d[2]; ITER(1); ITER(0); t = d[3]; ITER(2); ITER(1); ITER(0); t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0); t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0); 

This insertion sort is faster than Daniel Stutzbach's, and is especially good on a GPU or a computer with predication because ITER can be done with only 3 instructions (vs. 4 for SWAP). For example, here is the t = d[2]; ITER(1); ITER(0); line in ARM assembly:

  MOV r6, r2 CMP r6, r1 MOVLT r2, r1 MOVLT r1, r6 CMP r6, r0 MOVLT r1, r0 MOVLT r0, r6 

For six elements the insertion sort is competitive with the sorting network (12 swaps vs. 15 iterations balances 4 instructions/swap vs. 3 instructions/iteration); bubble sort of course is slower. But it's not going to be true when the size grows, since insertion sort is O(n^2) while sorting networks are O(n log n).

I ported the test suite to a PPC architecture machine I can not identify (didn't have to touch code, just increase the iterations of the test, use 8 test cases to avoid polluting results with mods and replace the x86 specific rdtsc):

Direct call to qsort library function : 101

Naive implementation (insertion sort) : 299

Insertion Sort (Daniel Stutzbach) : 108

Insertion Sort Unrolled : 51

Sorting Networks (Daniel Stutzbach) : 26

Sorting Networks (Paul R) : 85

Sorting Networks 12 with Fast Swap : 117

Sorting Networks 12 reordered Swap : 116

Rank Order : 56

An XOR swap may be useful in your swapping functions.

 void xorSwap (int *x, int *y) { if (*x != *y) { *x ^= *y; *y ^= *x; *x ^= *y; } } 

The if may cause too much divergence in your code, but if you have a guarantee that all your ints are unique this could be handy.

Looking forward to trying my hand at this and learning from these examples, but first some timings from my 1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM. (I borrowed a similar rdtsc-like timer for PPC from http://www.mcs.anl.gov/~kazutomo/rdtsc.html for the timings.) I ran the program a few times and the absolute results varied but the consistently fastest test was "Insertion Sort (Daniel Stutzbach)", with "Insertion Sort Unrolled" a close second.

Here's the last set of times:

 **Direct call to qsort library function** : 164 **Naive implementation (insertion sort)** : 138 **Insertion Sort (Daniel Stutzbach)** : 85 **Insertion Sort Unrolled** : 97 **Sorting Networks (Daniel Stutzbach)** : 457 **Sorting Networks (Paul R)** : 179 **Sorting Networks 12 with Fast Swap** : 238 **Sorting Networks 12 reordered Swap** : 236 **Rank Order** : 116 

Here is my contribution to this thread: an optimized 1, 4 gap shellsort for a 6-member int vector (valp) containing unique values.

 void shellsort (int *valp) { int c,a,*cp,*ip=valp,*ep=valp+5; c=*valp; a=*(valp+4);if (c>a) {*valp= a;*(valp+4)=c;} c=*(valp+1);a=*(valp+5);if (c>a) {*(valp+1)=a;*(valp+5)=c;} cp=ip; do { c=*cp; a=*(cp+1); do { if (c<a) break; *cp=a; *(cp+1)=c; cp-=1; c=*cp; } while (cp>=valp); ip+=1; cp=ip; } while (ip<ep); } 

On my HP dv7-3010so laptop with a dual-core Athlon M300 @ 2 Ghz (DDR2 memory) it executes in 165 clock cycles. This is an average calculated from timing every unique sequence (6!/720 in all). Compiled to Win32 using OpenWatcom 1.8. The loop is essentially an insertion sort and is 16 instructions/37 bytes long.

I do not have a 64-bit environment to compile on.

If insertion sort is reasonably competitive here, I would recommend trying a shellsort. I'm afraid 6 elements is probably just too little for it to be among the best, but it might be worth a try.

Example code, untested, undebugged, etc. You want to tune the inc = 4 and inc -= 3 sequence to find the optimum (try inc = 2, inc -= 1 for example).

 static __inline__ int sort6(int * d) { char j, i; int tmp; for (inc = 4; inc > 0; inc -= 3) { for (i = inc; i < 5; i++) { tmp = a[i]; j = i; while (j >= inc && a[j - inc] > tmp) { a[j] = a[j - inc]; j -= inc; } a[j] = tmp; } } } 

I don't think this will win, but if someone posts a question about sorting 10 elements, who knows…

According to Wikipedia this can even be combined with sorting networks: Pratt, V (1979). Shellsort and sorting networks (Outstanding dissertations in the computer sciences). Garland. ISBN 0-824-04406-1

This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).

The algorithm sort6 uses the algorithm sort4 which uses the algorithm sort3 . Here is the implementation in some light C++ form (the original is template-heavy so that it can work with any random-access iterator and any suitable comparison function).

Sorting 3 values

The following algorithm is an unrolled insertion sort. When two swaps (6 assignments) have to be performed, it uses 4 assignments instead:

 void sort3(int* array) { if (array[1] < array[0]) { if (array[2] < array[0]) { if (array[2] < array[1]) { std::swap(array[0], array[2]); } else { int tmp = array[0]; array[0] = array[1]; array[1] = array[2]; array[2] = tmp; } } else { std::swap(array[0], array[1]); } } else { if (array[2] < array[1]) { if (array[2] < array[0]) { int tmp = array[2]; array[2] = array[1]; array[1] = array[0]; array[0] = tmp; } else { std::swap(array[1], array[2]); } } } } 

It looks a bit complex because the sort has more or less one branch for every possible permutation of the array, using 2~3 comparisons and at most 4 assignments to sort the three values.

Sorting 4 values

This one calls sort3 then performs an unrolled insertion sort with the last element of the array:

 void sort4(int* array) { // Sort the first 3 elements sort3(array); // Insert the 4th element with insertion sort if (array[3] < array[2]) { std::swap(array[2], array[3]); if (array[2] < array[1]) { std::swap(array[1], array[2]); if (array[1] < array[0]) { std::swap(array[0], array[1]); } } } } 

This algorithm performs 3 to 6 comparisons and at most 5 swaps. It is easy to unroll an insertion sort, but we will be using another algorithm for the last sort…

Sorting 6 values

This one uses an unrolled version of what I called a double insertion sort . The name isn't that great, but it's quite descriptive, here is how it works:

  • Sort everything but the first and the last elements of the array.
  • Swap the first and the elements of the array if the first is greater than the last.
  • Insert the first element into the sorted sequence from the front then the last element from the back.

After the swap, the first element is always smaller than the last, which means that, when inserting them into the sorted sequence, there won't be more than N comparisons to insert the two elements in the worst case: for example, if the first element has been insert in the 3rd position, then the last one can't be inserted lower than the 4th position.

 void sort6(int* array) { // Sort everything but first and last elements sort4(array+1); // Switch first and last elements if needed if (array[5] < array[0]) { std::swap(array[0], array[5]); } // Insert first element from the front if (array[1] < array[0]) { std::swap(array[0], array[1]); if (array[2] < array[1]) { std::swap(array[1], array[2]); if (array[3] < array[2]) { std::swap(array[2], array[3]); if (array[4] < array[3]) { std::swap(array[3], array[4]); } } } } // Insert last element from the back if (array[5] < array[4]) { std::swap(array[4], array[5]); if (array[4] < array[3]) { std::swap(array[3], array[4]); if (array[3] < array[2]) { std::swap(array[2], array[3]); if (array[2] < array[1]) { std::swap(array[1], array[2]); } } } } } 

My tests on every permutation of 6 values ever show that this algorithms always performs between 6 and 13 comparisons. I didn't compute the number of swaps performed, but I don't expect it to be higher than 11 in the worst case.

I hope that this helps, even if this question may not represent an actual problem anymore 🙂

EDIT: after putting it in the provided benchmark, it is cleary slower than most of the interesting alternatives. It tends to perform a bit better than the unrolled insertion sort, but that's pretty much it. Basically, it isn't the best sort for integers but could be interesting for types with an expensive comparison operation.

I know I'm super-late, but I was interested in experimenting with some different solutions. First, I cleaned up that paste, made it compile, and put it into a repository. I kept some undesirable solutions as dead-ends so that others wouldn't try it. Among this was my first solution, which attempted to ensure that x1>x2 was calculated once. After optimization, it is no faster than the other, simple versions.

I added a looping version of rank order sort, since my own application of this study is for sorting 2-8 items, so since there are a variable number of arguments, a loop is necessary. This is also why I ignored the sorting network solutions.

The test code didn't test that duplicates were handled correctly, so while the existing solutions were all correct, I added a special case to the test code to ensure that duplicates were handled correctly.

Then, I wrote an insertion sort that is entirely in AVX registers. On my machine it is 25% faster than the other insertion sorts, but 100% slower than rank order. I did this purely for experiment and had no expectation of this being better due to the branching in insertion sort.

 static inline void sort6_insertion_sort_avx(int* d) { __m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], 0, 0); __m256i index = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7); __m256i shlpermute = _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6); __m256i sorted = _mm256_setr_epi32(d[0], INT_MAX, INT_MAX, INT_MAX, INT_MAX, INT_MAX, INT_MAX, INT_MAX); __m256i val, gt, permute; unsigned j; // 8 / 32 = 2^-2 #define ITER(I) \ val = _mm256_permutevar8x32_epi32(src, _mm256_set1_epi32(I));\ gt = _mm256_cmpgt_epi32(sorted, val);\ permute = _mm256_blendv_epi8(index, shlpermute, gt);\ j = ffs( _mm256_movemask_epi8(gt)) >> 2;\ sorted = _mm256_blendv_epi8(_mm256_permutevar8x32_epi32(sorted, permute),\ val, _mm256_cmpeq_epi32(index, _mm256_set1_epi32(j))) ITER(1); ITER(2); ITER(3); ITER(4); ITER(5); int x[8]; _mm256_storeu_si256((__m256i*)x, sorted); d[0] = x[0]; d[1] = x[1]; d[2] = x[2]; d[3] = x[3]; d[4] = x[4]; d[5] = x[5]; #undef ITER } 

Then, I wrote a rank order sort using AVX. This matches the speed of the other rank-order solutions, but is no faster. The issue here is that I can only calculate the indices with AVX, and then I have to make a table of indices. This is because the calculation is destination-based rather than source-based. See Converting from Source-based Indices to Destination-based Indices

 static inline void sort6_rank_order_avx(int* d) { __m256i ror = _mm256_setr_epi32(5, 0, 1, 2, 3, 4, 6, 7); __m256i one = _mm256_set1_epi32(1); __m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], INT_MAX, INT_MAX); __m256i rot = src; __m256i index = _mm256_setzero_si256(); __m256i gt, permute; __m256i shl = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 6, 6); __m256i dstIx = _mm256_setr_epi32(0,1,2,3,4,5,6,7); __m256i srcIx = dstIx; __m256i eq = one; __m256i rotIx = _mm256_setzero_si256(); #define INC(I)\ rot = _mm256_permutevar8x32_epi32(rot, ror);\ gt = _mm256_cmpgt_epi32(src, rot);\ index = _mm256_add_epi32(index, _mm256_and_si256(gt, one));\ index = _mm256_add_epi32(index, _mm256_and_si256(eq,\ _mm256_cmpeq_epi32(src, rot)));\ eq = _mm256_insert_epi32(eq, 0, I) INC(0); INC(1); INC(2); INC(3); INC(4); int e[6]; e[0] = d[0]; e[1] = d[1]; e[2] = d[2]; e[3] = d[3]; e[4] = d[4]; e[5] = d[5]; int i[8]; _mm256_storeu_si256((__m256i*)i, index); d[i[0]] = e[0]; d[i[1]] = e[1]; d[i[2]] = e[2]; d[i[3]] = e[3]; d[i[4]] = e[4]; d[i[5]] = e[5]; } 

The repo can be found here: https://github.com/eyepatchParrot/sort6/

I believe there are two parts to your question.

  • The first is to determine the optimal algorithm. This is done – at least in this case – by looping through every possible ordering (there aren't that many) which allows you to compute exact min, max, average and standard deviation of compares and swaps. Have a runner-up or two handy as well.
  • The second is to optimize the algorithm. A lot can be done to convert textbook code examples to mean and lean real-life algorithms. If you realize that an algorithm can't be optimized to the extent required, try a runner-up.

I wouldn't worry too much about emptying pipelines (assuming current x86): branch prediction has come a long way. What I would worry about is making sure that the code and data fit in one cache line each (maybe two for the code). Once there fetch latencies are refreshingly low which will compensate for any stall. It also means that your inner loop will be maybe ten instructions or so which is right where it should be (there are two different inner loops in my sorting algorithm, they are 10 instructions/22 bytes and 9/22 long respectively). Assuming the code doesn't contain any divs you can be sure it will be blindingly fast.

I found that at least on my system, the functions sort6_iterator() and sort6_iterator_local() defined below both ran at least as fast, and frequently noticeably faster, than the above current record holder:

 #define MIN(x, y) (x<y?x:y) #define MAX(x, y) (x<y?y:x) template<class IterType> inline void sort6_iterator(IterType it) { #define SWAP(x,y) { const auto a = MIN(*(it + x), *(it + y)); \ const auto b = MAX(*(it + x), *(it + y)); \ *(it + x) = a; *(it + y) = b; } SWAP(1, 2) SWAP(4, 5) SWAP(0, 2) SWAP(3, 5) SWAP(0, 1) SWAP(3, 4) SWAP(1, 4) SWAP(0, 3) SWAP(2, 5) SWAP(1, 3) SWAP(2, 4) SWAP(2, 3) #undef SWAP } 

I passed this function a std::vector 's iterator in my timing code. I suspect that using iterators gives g++ certain assurances about what can and can't happen to the memory that the iterator refers to, which it otherwise wouldn't have and it is these assurances that allow g++ to better optimize the sorting code (which if I remember correctly, is also part of the reason why so many std algorithms, such as std::sort() , generally have such obscenely good performance). However, while timing I noticed that the context (ie surrounding code) in which the call to the sorting function was made had a significant impact on performance, which is likely due to the function being inlined and then optimized. For instance, if the program was sufficiently simple then there usually wasn't much of a difference in performance between passing the sorting function a pointer versus passing it an iterator; otherwise using iterators usually resulted in noticeably better performance and never (in my experience so far at least) any noticeably worse performance. I suspect that this may be because g++ can globally optimize sufficiently simple code.

Moreover, sort6_iterator() is some times (again, depending on the context in which the function is called) consistently outperformed by the following sorting function:

 template<class IterType> inline void sort6_iterator_local(IterType it) { #define SWAP(x,y) { const auto a = MIN(data##x, data##y); \ const auto b = MAX(data##x, data##y); \ data##x = a; data##y = b; } //DD = Define Data #define DD1(a) auto data##a = *(it + a); #define DD2(a,b) auto data##a = *(it + a), data##b = *(it + b); //CB = Copy Back #define CB(a) *(it + a) = data##a; DD2(1,2) SWAP(1, 2) DD2(4,5) SWAP(4, 5) DD1(0) SWAP(0, 2) DD1(3) SWAP(3, 5) SWAP(0, 1) SWAP(3, 4) SWAP(1, 4) SWAP(0, 3) CB(0) SWAP(2, 5) CB(5) SWAP(1, 3) CB(1) SWAP(2, 4) CB(4) SWAP(2, 3) CB(2) CB(3) #undef CB #undef DD2 #undef DD1 #undef SWAP } 

Note that defining SWAP() as follows some times results in slightly better performance although most of the time it results in slightly worse performance or a negligible difference in performance.

 #define SWAP(x,y) { const auto a = MIN(data##x, data##y); \ data##y = MAX(data##x, data##y); \ data##x = a; } 

If you just want a sorting algorithm that gcc -O3 is consistently good at optimizing then, depending on how you pass the input, try one of the following two algorithms:

 template<class T> inline void sort6(T it) { #define SORT2(x,y) {if(data##x>data##y){auto a=std::move(data##y);data##y=std::move(data##x);data##x=std::move(a);}} #define DD1(a) register auto data##a=*(it+a); #define DD2(a,b) register auto data##a=*(it+a);register auto data##b=*(it+b); #define CB1(a) *(it+a)=data##a; #define CB2(a,b) *(it+a)=data##a;*(it+b)=data##b; DD2(1,2) SORT2(1,2) DD2(4,5) SORT2(4,5) DD1(0) SORT2(0,2) DD1(3) SORT2(3,5) SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5) SORT2(1,4) SORT2(0,3) CB1(0) SORT2(2,4) CB1(4) SORT2(1,3) CB1(1) SORT2(2,3) CB2(2,3) #undef CB1 #undef CB2 #undef DD1 #undef DD2 #undef SORT2 } 

要么

 template<class T> inline void sort6(T& e0, T& e1, T& e2, T& e3, T& e4, T& e5) { #define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);} #define DD1(a) register auto data##a=e##a; #define DD2(a,b) register auto data##a=e##a;register auto data##b=e##b; #define CB1(a) e##a=data##a; #define CB2(a,b) e##a=data##a;e##b=data##b; DD2(1,2) SORT2(1,2) DD2(4,5) SORT2(4,5) DD1(0) SORT2(0,2) DD1(3) SORT2(3,5) SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5) SORT2(1,4) SORT2(0,3) CB1(0) SORT2(2,4) CB1(4) SORT2(1,3) CB1(1) SORT2(2,3) CB2(2,3) #undef CB1 #undef CB2 #undef DD1 #undef DD2 #undef SORT2 } 

The reason for using the register keyword is because this is one of the few times that you know that you want these values in registers. Without register , the compiler will figure this out most of the time but sometimes it doesn't. Using the register keyword helps solve this issue. Normally, however, don't use the register keyword since it's more likely to slow your code than speed it up.

Also, note the use of templates. This is done on purpose since, even with the inline keyword, template functions are generally much more aggressively optimized by gcc than vanilla C functions (this has to do with gcc needing to deal with function pointers for vanilla C functions but not with template functions).

我知道这是一个古老的问题。

But I just wrote a different kind of solution I want to share.
Using nothing but nested MIN MAX,

It's not fast as it uses 114 of each,
could reduce it to 75 pretty simply like so -> pastebin

But then it's not purely min max anymore.

What might work is doing min/max on multiple integers at once with AVX

PMINSW reference

 #include <stdio.h> static __inline__ int MIN(int a, int b){ int result =a; __asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b)); return result; } static __inline__ int MAX(int a, int b){ int result = a; __asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b)); return result; } static __inline__ unsigned long long rdtsc(void){ unsigned long long int x; __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } #define MIN3(a, b, c) (MIN(MIN(a,b),c)) #define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d))) static __inline__ void sort6(int * in) { const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5]; in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) ); const int AB = MAX(A, B), AC = MAX(A, C), AD = MAX(A, D), AE = MAX(A, E), AF = MAX(A, F), BC = MAX(B, C), BD = MAX(B, D), BE = MAX(B, E), BF = MAX(B, F), CD = MAX(C, D), CE = MAX(C, E), CF = MAX(C, F), DE = MAX(D, E), DF = MAX(D, F), EF = MAX(E, F); in[1] = MIN4 ( MIN4( AB, AC, AD, AE ), MIN4( AF, BC, BD, BE ), MIN4( BF, CD, CE, CF ), MIN3( DE, DF, EF) ); const int ABC = MAX(AB,C), ABD = MAX(AB,D), ABE = MAX(AB,E), ABF = MAX(AB,F), ACD = MAX(AC,D), ACE = MAX(AC,E), ACF = MAX(AC,F), ADE = MAX(AD,E), ADF = MAX(AD,F), AEF = MAX(AE,F), BCD = MAX(BC,D), BCE = MAX(BC,E), BCF = MAX(BC,F), BDE = MAX(BD,E), BDF = MAX(BD,F), BEF = MAX(BE,F), CDE = MAX(CD,E), CDF = MAX(CD,F), CEF = MAX(CE,F), DEF = MAX(DE,F); in[2] = MIN( MIN4 ( MIN4( ABC, ABD, ABE, ABF ), MIN4( ACD, ACE, ACF, ADE ), MIN4( ADF, AEF, BCD, BCE ), MIN4( BCF, BDE, BDF, BEF )), MIN4( CDE, CDF, CEF, DEF ) ); const int ABCD = MAX(ABC,D), ABCE = MAX(ABC,E), ABCF = MAX(ABC,F), ABDE = MAX(ABD,E), ABDF = MAX(ABD,F), ABEF = MAX(ABE,F), ACDE = MAX(ACD,E), ACDF = MAX(ACD,F), ACEF = MAX(ACE,F), ADEF = MAX(ADE,F), BCDE = MAX(BCD,E), BCDF = MAX(BCD,F), BCEF = MAX(BCE,F), BDEF = MAX(BDE,F), CDEF = MAX(CDE,F); in[3] = MIN4 ( MIN4( ABCD, ABCE, ABCF, ABDE ), MIN4( ABDF, ABEF, ACDE, ACDF ), MIN4( ACEF, ADEF, BCDE, BCDF ), MIN3( BCEF, BDEF, CDEF ) ); const int ABCDE= MAX(ABCD,E), ABCDF= MAX(ABCD,F), ABCEF= MAX(ABCE,F), ABDEF= MAX(ABDE,F), ACDEF= MAX(ACDE,F), BCDEF= MAX(BCDE,F); in[4]= MIN ( MIN4( ABCDE, ABCDF, ABCEF, ABDEF ), MIN ( ACDEF, BCDEF ) ); in[5] = MAX(ABCDE,F); } int main(int argc, char ** argv) { int d[6][6] = { {1, 2, 3, 4, 5, 6}, {6, 5, 4, 3, 2, 1}, {100, 2, 300, 4, 500, 6}, {100, 2, 3, 4, 500, 6}, {1, 200, 3, 4, 5, 600}, {1, 1, 2, 1, 2, 1} }; unsigned long long cycles = rdtsc(); for (int i = 0; i < 6; i++) { sort6(d[i]); } cycles = rdtsc() - cycles; printf("Time is %d\n", (unsigned)cycles); for (int i = 0; i < 6; i++) { printf("d%d : %d %d %d %d %d %d\n", i, d[i][0], d[i][1], d[i][2], d[i][3], d[i][4], d[i][5]); } } 

编辑:
Rank order solution inspired by Rex Kerr's, Much faster than the mess above

 static void sort6(int *o) { const int A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5]; const unsigned char AB = A>B, AC = A>C, AD = A>D, AE = A>E, BC = B>C, BD = B>D, BE = B>E, CD = C>D, CE = C>E, DE = D>E, a = AB + AC + AD + AE + (A>F), b = 1 - AB + BC + BD + BE + (B>F), c = 2 - AC - BC + CD + CE + (C>F), d = 3 - AD - BD - CD + DE + (D>F), e = 4 - AE - BE - CE - DE + (E>F); o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E; o[15-abcde]=F; } 

Well, if it's only 6 elements and you can leverage parallelism, want to minimize conditional branching, etc. Why you don't generate all the combinations and test for order? I would venture that in some architectures, it can be pretty fast (as long as you have the memory preallocated)

Here are three typical sorting methods that represent three different classes of Sorting Algorithms:

 Insertion Sort: Θ(n^2) Heap Sort: Θ(n log n) Count Sort: Θ(3n) 

But check out Stefan Nelsson discussion on the fastest sorting algorithm? where he discuss a solution that goes down to O(n log log n) .. check out its implementation in C

This Semi-Linear Sorting algorithm was presented by a paper in 1995:

A. Andersson, T. Hagerup, S. Nilsson, and R. Raman. Sorting in linear time? In Proceedings of the 27th Annual ACM Symposium on the Theory of Computing, pages 427-436, 1995.

Try 'merging sorted list' sort. 🙂 Use two array. Fastest for small and big array.
If you concating, you only check where insert. Other bigger values you not need compare (cmp = ab>0).
For 4 numbers, you can use system 4-5 cmp (~4.6) or 3-6 cmp (~4.9). Bubble sort use 6 cmp (6). Lots of cmp for big numbers slower code.
This code use 5 cmp (not MSL sort):
if (cmp(arr[n][i+0],arr[n][i+1])>0) {swap(n,i+0,i+1);} if (cmp(arr[n][i+2],arr[n][i+3])>0) {swap(n,i+2,i+3);} if (cmp(arr[n][i+0],arr[n][i+2])>0) {swap(n,i+0,i+2);} if (cmp(arr[n][i+1],arr[n][i+3])>0) {swap(n,i+1,i+3);} if (cmp(arr[n][i+1],arr[n][i+2])>0) {swap(n,i+1,i+2);}

Principial MSL 9 8 7 6 5 4 3 2 1 0 89 67 45 23 01 ... concat two sorted lists, list length = 1 6789 2345 01 ... concat two sorted lists, list length = 2 23456789 01 ... concat two sorted lists, list length = 4 0123456789 ... concat two sorted lists, list length = 8

js code

 function sortListMerge_2a(cmp) { var step, stepmax, tmp, a,b,c, i,j,k, m,n, cycles; var start = 0; var end = arr_count; //var str = ''; cycles = 0; if (end>3) { stepmax = ((end - start + 1) >> 1) << 1; m = 1; n = 2; for (step=1;step<stepmax;step<<=1) //bounds 1-1, 2-2, 4-4, 8-8... { a = start; while (a<end) { b = a + step; c = a + step + step; b = b<end ? b : end; c = c<end ? c : end; i = a; j = b; k = i; while (i<b && j<c) { if (cmp(arr[m][i],arr[m][j])>0) {arr[n][k] = arr[m][j]; j++; k++;} else {arr[n][k] = arr[m][i]; i++; k++;} } while (i<b) {arr[n][k] = arr[m][i]; i++; k++; } while (j<c) {arr[n][k] = arr[m][j]; j++; k++; } a = c; } tmp = m; m = n; n = tmp; } return m; } else { // sort 3 items sort10(cmp); return m; } } 

Sort 4 items with usage cmp==0. Numbers of cmp is ~4.34 (FF native have ~4.52), but take 3x time than merging lists. But better less cmp operations, if you have big numbers or big text. Edit: repaired bug

Online test http://mlich.zam.slu.cz/js-sort/x-sort-x2.htm

 function sort4DG(cmp,start,end,n) // sort 4 { var n = typeof(n) !=='undefined' ? n : 1; var cmp = typeof(cmp) !=='undefined' ? cmp : sortCompare2; var start = typeof(start)!=='undefined' ? start : 0; var end = typeof(end) !=='undefined' ? end : arr[n].length; var count = end - start; var pos = -1; var i = start; var cc = []; // stabilni? cc[01] = cmp(arr[n][i+0],arr[n][i+1]); cc[23] = cmp(arr[n][i+2],arr[n][i+3]); if (cc[01]>0) {swap(n,i+0,i+1);} if (cc[23]>0) {swap(n,i+2,i+3);} cc[12] = cmp(arr[n][i+1],arr[n][i+2]); if (!(cc[12]>0)) {return n;} cc[02] = cc[01]==0 ? cc[12] : cmp(arr[n][i+0],arr[n][i+2]); if (cc[02]>0) { swap(n,i+1,i+2); swap(n,i+0,i+1); // bubble last to top cc[13] = cc[23]==0 ? cc[12] : cmp(arr[n][i+1],arr[n][i+3]); if (cc[13]>0) { swap(n,i+2,i+3); swap(n,i+1,i+2); // bubble return n; } else { cc[23] = cc[23]==0 ? cc[12] : (cc[01]==0 ? cc[30] : cmp(arr[n][i+2],arr[n][i+3])); // new cc23 | c03 //repaired if (cc[23]>0) { swap(n,i+2,i+3); return n; } return n; } } else { if (cc[12]>0) { swap(n,i+1,i+2); cc[23] = cc[23]==0 ? cc[12] : cmp(arr[n][i+2],arr[n][i+3]); // new cc23 if (cc[23]>0) { swap(n,i+2,i+3); return n; } return n; } else { return n; } } return n; }