使用这个指针会导致热循环中出现奇怪的去最佳化

我最近遇到了一个奇怪的去优化（或者说错过了优化的机会）。

考虑将3位整数数组有效解包为8位整数的函数。它在每个循环迭代中解包16个整数：

void unpack3bit(uint8_t* target, char* source, int size) { while(size > 0){ uint64_t t = *reinterpret_cast<uint64_t*>(source); target[0] = t & 0x7; target[1] = (t >> 3) & 0x7; target[2] = (t >> 6) & 0x7; target[3] = (t >> 9) & 0x7; target[4] = (t >> 12) & 0x7; target[5] = (t >> 15) & 0x7; target[6] = (t >> 18) & 0x7; target[7] = (t >> 21) & 0x7; target[8] = (t >> 24) & 0x7; target[9] = (t >> 27) & 0x7; target[10] = (t >> 30) & 0x7; target[11] = (t >> 33) & 0x7; target[12] = (t >> 36) & 0x7; target[13] = (t >> 39) & 0x7; target[14] = (t >> 42) & 0x7; target[15] = (t >> 45) & 0x7; source+=6; size-=6; target+=16; } }

以下是代码部分的生成汇编：

  ... 367: 48 89 c1 mov rcx,rax 36a: 48 c1 e9 09 shr rcx,0x9 36e: 83 e1 07 and ecx,0x7 371: 48 89 4f 18 mov QWORD PTR [rdi+0x18],rcx 375: 48 89 c1 mov rcx,rax 378: 48 c1 e9 0c shr rcx,0xc 37c: 83 e1 07 and ecx,0x7 37f: 48 89 4f 20 mov QWORD PTR [rdi+0x20],rcx 383: 48 89 c1 mov rcx,rax 386: 48 c1 e9 0f shr rcx,0xf 38a: 83 e1 07 and ecx,0x7 38d: 48 89 4f 28 mov QWORD PTR [rdi+0x28],rcx 391: 48 89 c1 mov rcx,rax 394: 48 c1 e9 12 shr rcx,0x12 398: 83 e1 07 and ecx,0x7 39b: 48 89 4f 30 mov QWORD PTR [rdi+0x30],rcx ...

它看起来相当有效。简单的shift right一个，然后一个store到target缓冲区。但是现在看看当我把函数改成结构中的方法时会发生什么：

 struct T{ uint8_t* target; char* source; void unpack3bit( int size); }; void T::unpack3bit(int size) { while(size > 0){ uint64_t t = *reinterpret_cast<uint64_t*>(source); target[0] = t & 0x7; target[1] = (t >> 3) & 0x7; target[2] = (t >> 6) & 0x7; target[3] = (t >> 9) & 0x7; target[4] = (t >> 12) & 0x7; target[5] = (t >> 15) & 0x7; target[6] = (t >> 18) & 0x7; target[7] = (t >> 21) & 0x7; target[8] = (t >> 24) & 0x7; target[9] = (t >> 27) & 0x7; target[10] = (t >> 30) & 0x7; target[11] = (t >> 33) & 0x7; target[12] = (t >> 36) & 0x7; target[13] = (t >> 39) & 0x7; target[14] = (t >> 42) & 0x7; target[15] = (t >> 45) & 0x7; source+=6; size-=6; target+=16; } }

我以为生成的程序集应该是相同的，但事实并非如此。这是它的一部分：

 ... 2b3: 48 c1 e9 15 shr rcx,0x15 2b7: 83 e1 07 and ecx,0x7 2ba: 88 4a 07 mov BYTE PTR [rdx+0x7],cl 2bd: 48 89 c1 mov rcx,rax 2c0: 48 8b 17 mov rdx,QWORD PTR [rdi] // Load, BAD! 2c3: 48 c1 e9 18 shr rcx,0x18 2c7: 83 e1 07 and ecx,0x7 2ca: 88 4a 08 mov BYTE PTR [rdx+0x8],cl 2cd: 48 89 c1 mov rcx,rax 2d0: 48 8b 17 mov rdx,QWORD PTR [rdi] // Load, BAD! 2d3: 48 c1 e9 1b shr rcx,0x1b 2d7: 83 e1 07 and ecx,0x7 2da: 88 4a 09 mov BYTE PTR [rdx+0x9],cl 2dd: 48 89 c1 mov rcx,rax 2e0: 48 8b 17 mov rdx,QWORD PTR [rdi] // Load, BAD! 2e3: 48 c1 e9 1e shr rcx,0x1e 2e7: 83 e1 07 and ecx,0x7 2ea: 88 4a 0a mov BYTE PTR [rdx+0xa],cl 2ed: 48 89 c1 mov rcx,rax 2f0: 48 8b 17 mov rdx,QWORD PTR [rdi] // Load, BAD! ...

正如你所看到的，在每次移位之前，我们在内存中引入了额外的冗余load （ mov rdx,QWORD PTR [rdi] ）。看起来像target指针（现在是一个成员，而不是本地variables）必须总是重新加载之前，存储到它。 这大大减慢了代码（在我的测量中大约为15％）。

首先，我想也许C ++内存模型强制执行一个成员指针可能不会被存储在一个寄存器，但必须重新加载，但这似乎是一个尴尬的select，因为它会使很多可行的优化不可能。所以我很惊讶，编译器没有把target寄存在这里的寄存器中。

我试图caching成员指针自己到一个局部variables：

 void T::unpack3bit(int size) { while(size > 0){ uint64_t t = *reinterpret_cast<uint64_t*>(source); uint8_t* target = this->target; // << ptr cached in local variable target[0] = t & 0x7; target[1] = (t >> 3) & 0x7; target[2] = (t >> 6) & 0x7; target[3] = (t >> 9) & 0x7; target[4] = (t >> 12) & 0x7; target[5] = (t >> 15) & 0x7; target[6] = (t >> 18) & 0x7; target[7] = (t >> 21) & 0x7; target[8] = (t >> 24) & 0x7; target[9] = (t >> 27) & 0x7; target[10] = (t >> 30) & 0x7; target[11] = (t >> 33) & 0x7; target[12] = (t >> 36) & 0x7; target[13] = (t >> 39) & 0x7; target[14] = (t >> 42) & 0x7; target[15] = (t >> 45) & 0x7; source+=6; size-=6; this->target+=16; } }

这个代码也可以产生“好”的汇编程序，而不需要额外的存储。所以我的猜测是：编译器不允许提升结构的成员指针的负载，所以这样的“热指针”应该总是存储在局部variables中。

那么，为什么编译器无法优化这些负载呢？
C ++内存模型是否禁止？ 或者，这只是我的编译器的一个缺点？
我的猜测是正确的吗？优化不能执行的确切原因是什么？

正在使用的编译器是带有-O3优化的g++ 4.8.2-19ubuntu1 。我也尝试过clang++ 3.4-1ubuntu3 ，结果类似：Clang甚至可以用本地target指针向量化方法。但是，使用this->target指针会得到相同的结果：在每个存储之前额外加载指针。

我检查了一些类似方法的汇编器，结果是一样的：看起来这个成员总是必须在存储之前重新加载，即使这样的加载可以简单地在循环之外被吊起来。我将不得不重写很多代码来摆脱这些额外的存储空间，主要是通过将指针本身caching到声明为热代码的局部variables中。 但是我一直认为摆弄这样的细节，例如在局部variables中caching一个指针，肯定有资格在编译器变得如此聪明的日子里过早的优化。 但在这里看来我错了 。在热循环中caching成员指针似乎是一个必要的手动优化技术。

指针别名似乎是这个问题，具有讽刺意味的是在this和this->target 。编译器正在考虑您初始化的相当猥亵的可能性：

this->target = &this

在这种情况下，写入this->target[0]会改变这个内容（也就是this-> target）。

内存别名问题不限于上述内容。原则上，任何使用this->target[XX] （适当值）的this->target[XX]可能指向this 。

我更熟悉C语言，通过使用__restrict__关键字声明指针variables可以解决这个问题。

严格的别名规则允许char*别名任何其他指针。所以this->target可以用this别名，在你的代码方法中，代码的第一部分，

 target[0] = t & 0x7; target[1] = (t >> 3) & 0x7; target[2] = (t >> 6) & 0x7;

其实是

 this->target[0] = t & 0x7; this->target[1] = (t >> 3) & 0x7; this->target[2] = (t >> 6) & 0x7;

因为当你修改this->target内容时， this可能会被修改。

一旦this->target被caching到局部variables中，别名就不再可能与局部variables一起使用。

这里的问题是严格的别名，它说，我们被允许通过一个char *别名，这样可以防止在你的情况下编译器优化。我们不允许通过不同types的指针进行别名，这种types的指针可能是未定义的行为，通常我们会看到这个问题，即用户尝试通过不兼容的指针types尝试别名。

将uint8_t作为unsigned char来实现似乎是合理的，如果我们看看Coliru上的cstdint，它包含stdint.h ，其中typedef uint8_t如下所示：

 typedef unsigned char uint8_t;

如果你使用了另一个非chartypes，那么编译器应该能够优化。

这在C ++标准草案第3.10节中给出了左值和右值的说明：

如果程序试图通过以下types之一的glvalue来访问对象的存储值，则行为是未定义的

并包含以下内容：

一个字符或无符号的字符types。

请注意，我发表了一个可能的解决方法，在一个问题的注释问什么时候uint8_t≠unsigned char？ build议是：

然而，简单的解决方法是使用restrict关键字，或者将指针复制到一个局部variables，这个局部variables的地址永远不会被使用，这样编译器就不用担心uint8_t对象是否可以将其别名。

由于C ++不支持restrict关键字，所以你不得不依赖于编译器扩展，例如gcc使用__restrict__，所以这不是完全可移植的，但是另一个build议应该是。

使用这个指针会导致热循环中出现奇怪的去最佳化

使用优化的Levenshteinalgorithm寻找最近的邻居

g ++中的优化级别-O3是否危险？

在PHP中将string转换为整数的最快方法

如何在HTML中定位一个图像？

现代C ++编译器能否避免在某些情况下调用const函数两次？

在Javascript中比较string的最佳方法？

如何阻止C＃用它们的值replaceconstvariables？

C ++：将一个操作数保存在一个寄存器中，这个速度非常巨大

为什么Clang会优化x * 1.0而不是x + 0.0？

与SSE有效的4x4matrix向量乘法：水平添加和点积 – 有什么意义？