用翻译复制内存的快速方法 – ARGB到BGR

概观

我有一个图像缓冲区，我需要转换为另一种格式。原始图像缓冲区是四个通道，每个通道8位，Alpha，Red，Green和Blue。目标缓冲区是三个通道，每个通道8位，蓝色，绿色和红色。

所以powershell方法是：

// Assume a 32 x 32 pixel image #define IMAGESIZE (32*32) typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB; typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR; ARGB orig[IMAGESIZE]; BGR dest[IMAGESIZE]; for(x = 0; x < IMAGESIZE; x++) { dest[x].Red = orig[x].Red; dest[x].Green = orig[x].Green; dest[x].Blue = orig[x].Blue; }

但是，我需要比循环和三字节拷贝提供更多的速度。我希望可能有一些技巧，我可以用来减less内存读取和写入的数量，因为我在32位机器上运行。

附加信息

每个图像是至less4个像素的倍数。所以我们可以处理16个ARGB字节，并将它们移动到每个循环的12个RGB字节。也许这个事实可以用来加快速度，特别是当它很好地落入32位的边界。

我可以访问OpenCL，虽然这需要将整个缓冲区移动到GPU内存中，然后将结果移回去，事实上OpenCL可以同时在图像的很多部分上工作，而大内存块移动的事实实际上是相当高效可能使这个值得探索。

虽然我已经给出了上面的小缓冲区的例子，但是我真的在移动高清video（1920×1080），有时候还会有更大的，更小的缓冲区，所以虽然32×32的情况可能是微不足道的，但是逐字节复制8.3MB的图像数据真的，真的很糟糕。

运行在英特尔处理器（Core 2及以上版本）上，因此我知道存在stream处理和数据处理命令，但是不知道 – 可能是关于在哪里寻找专门的数据处理指令的指针。

这是进入OS X应用程序，我正在使用XCode 4.如果大会是无痛的和明显的方式去，我很好的旅行沿着这条路，但没有做这个设置之前让我提防沉没了太多的时间。

伪代码是好的 – 我不是在寻找一个完整的解决scheme，只是algorithm和任何可能不会马上清除的技巧的解释。

我写了4个不同的版本，通过交换字节来工作。我使用gcc 4.2.1和-O3 -mssse3编译它们，在32MB随机数据上运行它们10次，并find平均值。

第一个版本使用C循环分别转换每个像素，使用OSSwapInt32函数（编译为带有-O3的bswap指令）。

 void swap1(ARGB *orig, BGR *dest, unsigned imageSize) { unsigned x; for(x = 0; x < imageSize; x++) { *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]); } }

第二种方法执行相同的操作，但使用内联汇编循环而不是C循环。

 void swap2(ARGB *orig, BGR *dest, unsigned imageSize) { asm ( "0:\n\t" "movl (%1),%%eax\n\t" "bswapl %%eax\n\t" "movl %%eax,(%0)\n\t" "addl $4,%1\n\t" "addl $3,%0\n\t" "decl %2\n\t" "jnz 0b" :: "D" (dest), "S" (orig), "c" (imageSize) : "flags", "eax" ); }

第三个版本是一个poseur的答案的修改版本。我将内置函数转换为GCC等价物，并使用lddqu内置函数，以便input参数不需要alignment。

 typedef uint8_t v16qi __attribute__ ((vector_size (16))); void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) { v16qi mask = __builtin_ia32_lddqu((const char[]){3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF}); uint8_t *end = orig + imagesize * 4; for (; orig != end; orig += 16, dest += 12) { __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask)); } }

最后，第四个版本是第三个内联程序集。

 void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) { int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};//{0xFF, 0xFF, 0xFF, 0xFF, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3}; asm ( "lddqu (%3),%%xmm1\n\t" "0:\n\t" "lddqu (%1),%%xmm0\n\t" "pshufb %%xmm1,%%xmm0\n\t" "movdqu %%xmm0,(%0)\n\t" "add $16,%1\n\t" "add $12,%0\n\t" "sub $4,%2\n\t" "jnz 0b" :: "r" (dest), "r" (orig), "r" (imagesize), "r" (mask) : "flags", "xmm0", "xmm1" ); }

在我的2010 MacBook Pro，2.4 Ghz i5,4GB内存，这些是平均每个时间：

版本1：10.8630毫秒
版本2：11.3254毫秒
版本3：9.3163毫秒
版本4：9.3584毫秒

正如你所看到的，编译器在优化时足够好，不需要编写程序集。另外，vector函数在32MB的数据上只有1.5毫秒的速度，所以如果你想支持最早的不支持SSSE3的英特尔Mac，它不会造成太大的伤害。

编辑：liori要求标准偏差信息。不幸的是，我没有保存数据点，所以我进行了25次迭代的另一个testing。

              平均| 标准偏差
蛮力：18.01956 ms |  1.22980 ms（6.8％）
版本1：11.13120 ms |  0.81076毫秒（7.3％）
版本2：11.27092 ms |  0.66209 ms（5.9％）
版本3：9.29184 ms |  0.27851毫秒（3.0％）
版本4：9.40948 ms |  0.32702 ms（3.5％）

另外，这里是来自新testing的原始数据，以防万一谁想要。对于每一次迭代，一个32MB的数据集被随机生成并通过四个函数运行。下面列出了每个函数的运行时间（以微秒为单位）。

蛮力：22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845
版本1：10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601
版本2：10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936
版本3：9036 9619 9341 8970 9453 9758 9043 10114 9243 9027 9163 9176 9168 9122 9514 9049 9161 9086 9064 9604 9178 9233 9301 9717 9156
版本4：9339 10119 9846 9217 9526 9182 9145 10286 9051 9614 9249 9653 9799 9270 9173 9103 9132 9550 9147 9157 9199 9113 9699 9354 9314

显而易见，使用pshufb。

 #include <assert.h> #include <inttypes.h> #include <tmmintrin.h> // needs: // orig is 16-byte aligned // imagesize is a multiple of 4 // dest has 4 trailing scratch bytes void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) { assert((uintptr_t)orig % 16 == 0); assert(imagesize % 4 == 0); __m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); uint8_t *end = orig + imagesize * 4; for (; orig != end; orig += 16, dest += 12) { _mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask)); } }

结合一个poseur和Jitamaro的答案，如果您认为input和输出是16字节alignment的，并且如果您一次处理像素4，则可以使用混合，蒙版，和和或的组合来使用alignment商店。主要思想是生成四个中间数据集，然后或者与掩码一起select相关像素值并写出3个16字节的像素数据集。请注意，我没有编译或尝试运行它。

编辑2：有关底层代码结构的更多细节：

使用SSE2，16字节alignment的读取和16字节的写入可以获得更好的性能。由于您的3个字节的像素只能按每16个像素16个字节alignment，所以我们一次使用16个input像素的混合和蒙版组合或16个input像素组合16个像素。

从LSB到MSB，input看起来像这样，忽略了特定的组件：

 s[0]: 0000 0000 0000 0000 s[1]: 1111 1111 1111 1111 s[2]: 2222 2222 2222 2222 s[3]: 3333 3333 3333 3333

而这些馅饼看起来像这样：

 d[0]: 000 000 000 000 111 1 d[1]: 11 111 111 222 222 22 d[2]: 2 222 333 333 333 333

所以为了生成这些输出，你需要做以下的事情（我将在稍后指定实际的转换）：

 d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1])) d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2])) d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))

现在，应该combine_<x>看起来像？如果我们假设d只是s压缩在一起，我们可以将两个s与一个面具和一个or连接：

 combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))

其中（1表示select左侧像素，0表示select右侧像素）：掩码（0）= 111 111 111 111 000 0掩码（1）= 11 111 111 000 000 00掩码（2）= 1 111 000 000 000 000

但是实际的转换（ f_<x>_low ， f_<x>_high ）其实并不那么简单。由于我们正在从源像素中反转并移除字节，所以实际的转换（为了简洁起见，为了第一个目的地）：

 d[0]= s[0][0].Blue s[0][0].Green s[0][0].Red s[0][1].Blue s[0][1].Green s[0][1].Red s[0][2].Blue s[0][2].Green s[0][2].Red s[0][3].Blue s[0][3].Green s[0][3].Red s[1][0].Blue s[1][0].Green s[1][0].Red s[1][1].Blue

如果将上述内容转换为从源到目标的字节偏移量，则会得到：d [0] =＆s [0] +3＆s [0] +2＆s [0] +1
＆s [0] +7＆s [0] +6＆s [0] +5＆s [0] +11＆s [0] +10＆s [0] +9＆s [0] +15＆s [0] +14＆s [ 0] 13
＆s [1] +3＆s [1] +2＆s [1] +1
＆S [1] +7

（如果你看看所有的s [0]偏移量，它们只是按照相反的顺序匹配一个poseur的shuffle掩码。）

现在，我们可以生成一个shuffle掩码来映射每个源字节到一个目标字节（ X表示我们不关心这个值是什么）：

 f_0_low= 3 2 1 7 6 5 11 10 9 15 14 13 XXXX f_0_high= XXXXXXXXXXXX 3 2 1 7 f_1_low= 6 5 11 10 9 15 14 13 XXXXXXXX f_1_high= XXXXXXXX 3 2 1 7 6 5 11 10 f_2_low= 9 15 14 13 XXXXXXXXXXXX f_2_high= XXXX 3 2 1 7 6 5 11 10 9 15 14 13

我们可以通过查看我们用于每个源像素的蒙版进一步优化。如果你看看我们用于s [1]的洗牌口罩：

 f_0_high= XXXXXXXXXXXX 3 2 1 7 f_1_low= 6 5 11 10 9 15 14 13 XXXXXXXX

由于两个shuffle mask不重叠，所以我们可以将它们组合起来，简单的遮住combine_中的无关像素，我们已经做到了！下面的代码执行所有这些优化（再加上它假设源地址和目标地址是16字节alignment的）。另外，掩码在MSB-> LSB顺序中以代码forms写出，以防您对该顺序感到困惑。

编辑：改变商店_mm_stream_si128因为你可能做了很多的写入，我们不想一定要刷新caching。另外，它应该无论如何alignment，所以你得到免费的perf！

 #include <assert.h> #include <inttypes.h> #include <tmmintrin.h> // needs: // orig is 16-byte aligned // imagesize is a multiple of 4 // dest has 4 trailing scratch bytes void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) { assert((uintptr_t)orig % 16 == 0); assert(imagesize % 16 == 0); __m128i shuf0 = _mm_set_epi8( -128, -128, -128, -128, // top 4 bytes are not used 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel __m128i shuf1 = _mm_set_epi8( 7, 1, 2, 3, // top 4 bytes go to the first pixel -128, -128, -128, -128, // unused 13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel __m128i shuf2 = _mm_set_epi8( 10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel -128, -128, -128, -128, // unused 13, 14, 15, 9); // bottom 4 go to third pixel __m128i shuf3 = _mm_set_epi8( 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel -128, -128, -128, -128); // unused __m128i mask0 = _mm_set_epi32(0, -1, -1, -1); __m128i mask1 = _mm_set_epi32(0, 0, -1, -1); __m128i mask2 = _mm_set_epi32(0, 0, 0, -1); uint8_t *end = orig + imagesize * 4; for (; orig != end; orig += 64, dest += 48) { __m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0); __m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1); __m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2); __m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3); _mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0)); _mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1)); _mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2)); } }

我来晚了一点，似乎社区已经决定了poseur的pshufb答案，但分配2000年的声誉，这是非常慷慨的，我不得不尝试一下。

这里是我的版本，没有平台特定的内在函数或机器特定的asm，我已经包含了一些跨平台的时间代码，显示了4倍的加速，如果你像我这样一边玩，并且激活编译器优化（注册优化，循环展开）：

 #include "stdlib.h" #include "stdio.h" #include "time.h" #define UInt8 unsigned char #define IMAGESIZE (1920*1080) int main() { time_t t0, t1; int frames; int frame; typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB; typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR; ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB)); if(!orig) {printf("nomem1");} BGR* dest = malloc(IMAGESIZE*sizeof(BGR)); if(!dest) {printf("nomem2");} printf("to start original hit a key\n"); getch(); t0 = time(0); frames = 1200; for(frame = 0; frame<frames; frame++) { int x; for(x = 0; x < IMAGESIZE; x++) { dest[x].Red = orig[x].Red; dest[x].Green = orig[x].Green; dest[x].Blue = orig[x].Blue; x++; } } t1 = time(0); printf("finished original of %u frames in %u seconds\n", frames, t1-t0); // on my core 2 subnotebook the original took 16 sec // (8 sec with compiler optimization -O3) so at 60 FPS // (instead of the 1200) this would be faster than realtime // (if you disregard any other rendering you have to do). // However if you either want to do other/more processing // OR want faster than realtime processing for eg a video-conversion // program then this would have to be a lot faster still. printf("to start alternative hit a key\n"); getch(); t0 = time(0); frames = 1200; unsigned int* reader; unsigned int* end = reader+IMAGESIZE; unsigned int cur; // your question guarantees 32 bit cpu unsigned int next; unsigned int temp; unsigned int* writer; for(frame = 0; frame<frames; frame++) { reader = (void*)orig; writer = (void*)dest; next = *reader; reader++; while(reader<end) { cur = next; next = *reader; // in the following the numbers are of course the bitmasks for // 0-7 bits, 8-15 bits and 16-23 bits out of the 32 temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255); *writer = temp; reader++; writer++; cur = next; next = *reader; temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280); *writer = temp; reader++; writer++; cur = next; next = *reader; temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680); *writer = temp; reader++; writer++; } } t1 = time(0); printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0); // on my core 2 subnotebook this alternative took 10 sec // (4 sec with compiler optimization -O3) }

结果是这些（在我的核心2亚笔记本上）：

 F:\>gcc bc -o b.exe F:\>b to start original hit a key finished original of 1200 frames in 16 seconds to start alternative hit a key finished alternative of 1200 frames in 10 seconds F:\>gcc bc -O3 -o b.exe F:\>b to start original hit a key finished original of 1200 frames in 8 seconds to start alternative hit a key finished alternative of 1200 frames in 4 seconds

您想使用Duff的设备： http ： //en.wikipedia.org/wiki/Duff%27s_device 。它也在使用JavaScript。然而，这篇文章阅读http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html有点有趣。; 想象一下，达夫装置有512千字节的移动。

结合其中一个快速转换函数，如果能够访问Core 2s，将翻译分解成线程是比较明智的，线程可以工作在第四个数据上，就像这个pdedecode：

 void bulk_bgrFromArgb(byte[] dest, byte[] src, int n) { thread threads[] = { create_thread(bgrFromArgb, dest, src, n/4), create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4), create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4), create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4), } join_threads(threads); }

这个汇编函数应该做，但是我不知道你是否想保留旧数据，这个函数覆盖它。

代码适用于具有intel组装风格的MinGW GCC ，您将不得不对其进行修改以适应您的编译器/汇编程序。

 extern "C" { int convertARGBtoBGR(uint buffer, uint size); __asm( ".globl _convertARGBtoBGR\n" "_convertARGBtoBGR:\n" " push ebp\n" " mov ebp, esp\n" " sub esp, 4\n" " mov esi, [ebp + 8]\n" " mov edi, esi\n" " mov ecx, [ebp + 12]\n" " cld\n" " convertARGBtoBGR_loop:\n" " lodsd ; load value from [esi] (4byte) to eax, increment esi by 4\n" " bswap eax ; swap eax ( ARGB ) to ( BGRA )\n" " stosd ; store 4 bytes to [edi], increment edi by 4\n" " sub edi, 1; move edi 1 back down, next time we will write over A byte\n" " loop convertARGBtoBGR_loop\n" " leave\n" " ret\n" ); }

你应该这样称呼它：

 convertARGBtoBGR( &buffer, IMAGESIZE );

这个函数每个像素/数据包访问内存只需要两次（1次读取，1次写入）， 相比之下 （至less/假设它被编译为注册）3次读取和3次写入操作的蛮力方法 。方法是一样的，但实现使得它更有效率。

你可以用4个像素的块来完成，用无符号长指针移动32位。只要想一想，通过移位和OR / AND就可以构build4 32位像素，3个字代表4个24位像素，如下所示：

 //col0 col1 col2 col3 //ARGB ARGB ARGB ARGB 32bits reading (4 pixels) //BGRB GRBG RBGR 32 bits writing (4 pixels)

在所有现代的32/64位处理器（桶式移位技术）中，移位操作总是由1个指令周期完成，所以其构build这3个字用于写入，按位AND和OR的最快方式也是快速的。

喜欢这个：

 //assuming we have 4 ARGB1 ... ARGB4 pixels and 3 32 bits words, W1, W2 and W3 to write // and *dest its an unsigned long pointer for destination W1 = ((ARGB1 & 0x000f) << 24) | ((ARGB1 & 0x00f0) << 8) | ((ARGB1 & 0x0f00) >> 8) | (ARGB2 & 0x000f); *dest++ = W1;

等等…与循环中的下一个像素。

你需要用不是4的倍数的图像调整，但我敢打赌，这是所有的最快的方法，而不使用汇编。

顺便说一句，忘记使用结构和索引访问，这些都是移动数据的所有方法，只要看一看编译的C ++程序的反汇编列表，你会同意我的看法。

 typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB; typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

除了汇编或编译器内在函数之外，我可能会尝试执行以下操作，同时仔细validation最终行为 ，因为其中的一些（涉及工会）很可能是编译器实现相关的：

 union uARGB { struct ARGB argb; UInt32 x; }; union uBGRA { struct { BGR bgr; UInt8 Alpha; } bgra; UInt32 x; };

然后为你的代码内核，用任何循环展开是合适的：

 inline void argb2bgr(BGR* pbgr, ARGB* pargb) { uARGB* puargb = (uARGB*)pargb; uBGRA ubgra; ubgra.x = __byte_reverse_32(pargb->x); *pbgr = ubgra.bgra.bgr; }

其中__byte_reverse_32()假定存在一个编译器内部函数，可以反转32位字的字节。

总结下面的方法：

查看ARGB结构为32位整数
反转32位整数
查看反转的32位整数作为（BGR）A结构
让编译器复制（BGR）A结构的（BGR）部分

虽然你可以使用一些基于CPU使用的技巧，

 This kind of operations can be done fasted with GPU.

看来你使用C / C ++ …所以你的替代GPU编程可能是（在Windows平台上）

DirectCompute（DirectX 11）请参阅此video
微软研究项目加速器检查这个链接
CUDA
“谷歌”的GPU编程…

这种arrays操作很快就会使用GPU来进行更快速的计算。他们是为它devise的。

我还没有看到任何人展示如何在GPU上做的例子。

前一阵子我写了类似于你的问题的东西。我从YUV格式的video4linux2摄像机收到数据，想把它画成灰度（只是Y分量）。我也想绘制蓝色太深的区域和红色的过饱和区域。

我从freeglut发行版的smooth_opengl3.c例子开始。

数据作为YUV复制到纹理中，然后应用以下GLSL着色器程序。我相信GLSL代码现在可以在所有的Mac上运行，并且比所有的CPU方法都快得多。

请注意，我没有关于如何获取数据的经验。理论上glReadPixels应该读取数据，但是我从来没有测量过它的性能。

OpenCL可能是更简单的方法，但是当我有一个支持它的笔记本时，我只会开始开发。

 (defparameter *vertex-shader* "void main(){ gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex; gl_FrontColor = gl_Color; gl_TexCoord[0] = gl_MultiTexCoord0; } ") (progn (defparameter *fragment-shader* "uniform sampler2D textureImage; void main() { vec4 q=texture2D( textureImage, gl_TexCoord[0].st); float v=qz; if(int(gl_FragCoord.x)%2 == 0) v=qx; float x=0; // 1./255.; v-=.278431; v*=1.7; if(v>=(1.0-x)) gl_FragColor = vec4(255,0,0,255); else if (v<=x) gl_FragColor = vec4(0,0,255,255); else gl_FragColor = vec4(v,v,v,255); } ")

在这里输入图像说明

用翻译复制内存的快速方法 – ARGB到BGR

概观

附加信息

最长的子序列

从未知长度的序列中随机选取N个项目

“为”循环第一次迭代

高维数据中最近的邻居？

为什么会有人使用set而不是unordered_set？

我怎样才能打印出给定的电话号码可以代表的所有可能的字母组合？

什么是find重叠矩形区域的高效algorithm

计算下一次执行cron作业的时间

在数组中查找重复的元素O（n）

创build两个数字的哈希码