预取示例？

任何人都可以举一个例子或链接到一个在GCC中使用__builtin_prefetch的例子（或者只是一般的asm指令prefetcht0）来获得实质的性能优势吗？特别是，我希望例子符合以下标准：

这是一个简单，小巧，独立的例子。
删除__builtin_prefetch指令会导致性能下降。
将__builtin_prefetch指令replace为相应的内存访问会导致性能下降。

也就是说，我想最短的例子显示__builtin_prefetch执行优化，没有它不能pipe理。

这是我从一个更大的项目中取出的一段实际的代码。（对不起，这是我能find的最短的一个，从预取有一个明显的加速。）这段代码执行一个非常大的数据转置。

本示例使用SSE预取指令，可能与GCC发出的指令相同。

为了运行这个例子，你将需要为x64编译这个，并且有超过4GB的内存。您可以使用较小的数据大小运行它，但速度太快，时间不够。

 #include <iostream> using std::cout; using std::endl; #include <emmintrin.h> #include <malloc.h> #include <time.h> #include <string.h> #define ENABLE_PREFETCH #define f_vector __m128d #define i_ptr size_t inline void swap_block(f_vector *A,f_vector *B,i_ptr L){ // To be super-optimized later. f_vector *stop = A + L; do{ f_vector tmpA = *A; f_vector tmpB = *B; *A++ = tmpB; *B++ = tmpA; }while (A < stop); } void transpose_even(f_vector *T,i_ptr block,i_ptr x){ // Transposes T. // T contains x columns and x rows. // Each unit is of size (block * sizeof(f_vector)) bytes. //Conditions: // - 0 < block // - 1 < x i_ptr row_size = block * x; i_ptr iter_size = row_size + block; // End of entire matrix. f_vector *stop_T = T + row_size * x; f_vector *end = stop_T - row_size; // Iterate each row. f_vector *y_iter = T; do{ // Iterate each column. f_vector *ptr_x = y_iter + block; f_vector *ptr_y = y_iter + row_size; do{ #ifdef ENABLE_PREFETCH _mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0); #endif swap_block(ptr_x,ptr_y,block); ptr_x += block; ptr_y += row_size; }while (ptr_y < stop_T); y_iter += iter_size; }while (y_iter < end); } int main(){ i_ptr dimension = 4096; i_ptr block = 16; i_ptr words = block * dimension * dimension; i_ptr bytes = words * sizeof(f_vector); cout << "bytes = " << bytes << endl; // system("pause"); f_vector *T = (f_vector*)_mm_malloc(bytes,16); if (T == NULL){ cout << "Memory Allocation Failure" << endl; system("pause"); exit(1); } memset(T,0,bytes); // Perform in-place data transpose cout << "Starting Data Transpose... "; clock_t start = clock(); transpose_even(T,block,dimension); clock_t end = clock(); cout << "Done" << endl; cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl; _mm_free(T); system("pause"); }

当我启用ENABLE_PREFETCH运行它，这是输出：

 bytes = 4294967296 Starting Data Transpose... Done Time: 0.725 seconds Press any key to continue . . .

当我运行ENABLE_PREFETCH禁用时，这是输出：

 bytes = 4294967296 Starting Data Transpose... Done Time: 0.822 seconds Press any key to continue . . .

所以预取速度提高了13％。

编辑：

这里有更多的结果：

 Operating System: Windows 7 Professional/Ultimate Compiler: Visual Studio 2010 SP1 Compile Mode: x64 Release Intel Core i7 860 @ 2.8 GHz, 8 GB DDR3 @ 1333 MHz Prefetch : 0.868 No Prefetch: 0.960 Intel Core i7 920 @ 3.5 GHz, 12 GB DDR3 @ 1333 MHz Prefetch : 0.725 No Prefetch: 0.822 Intel Core i7 2600K @ 4.6 GHz, 16 GB DDR3 @ 1333 MHz Prefetch : 0.718 No Prefetch: 0.796 2 x Intel Xeon X5482 @ 3.2 GHz, 64 GB DDR2 @ 800 MHz Prefetch : 2.273 No Prefetch: 2.666

二进制search是一个简单的例子，可以从显式预取中受益。二进制search的访问模式对硬件预取器而言几乎是随机的，所以很难准确地预测要获取的内容。

在这个例子中，我预取了当前迭代中下一个循环迭代的两个可能的“中间”位置。一个预取可能永远不会被使用，但另一个（除非这是最后一次迭代）。

  #include <time.h> #include <stdio.h> #include <stdlib.h> int binarySearch(int *array, int number_of_elements, int key) { int low = 0, high = number_of_elements-1, mid; while(low <= high) { mid = (low + high)/2; #ifdef DO_PREFETCH // low path __builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1); // high path __builtin_prefetch (&array[(low + mid - 1)/2], 0, 1); #endif if(array[mid] < key) low = mid + 1; else if(array[mid] == key) return mid; else if(array[mid] > key) high = mid-1; } return -1; } int main() { int SIZE = 1024*1024*512; int *array = malloc(SIZE*sizeof(int)); for (int i=0;i<SIZE;i++){ array[i] = i; } int NUM_LOOKUPS = 1024*1024*8; srand(time(NULL)); int *lookups = malloc(NUM_LOOKUPS * sizeof(int)); for (int i=0;i<NUM_LOOKUPS;i++){ lookups[i] = rand() % SIZE; } for (int i=0;i<NUM_LOOKUPS;i++){ int result = binarySearch(array, SIZE, lookups[i]); } free(array); free(lookups); }

当我使用DO_PREFETCH来编译和运行这个例子时，运行时间减less了20％

  $ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3 $ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3 $ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch Performance counter stats for './with-prefetch': 356,675,702 L1-dcache-load-misses # 41.39% of all L1-dcache hits 861,807,382 L1-dcache-loads 8.787467487 seconds time elapsed $ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch Performance counter stats for './no-prefetch': 382,423,177 L1-dcache-load-misses # 97.36% of all L1-dcache hits 392,799,791 L1-dcache-loads 11.376439030 seconds time elapsed

请注意，我们正在预取版本中执行两倍的L1高速caching加载。我们实际上正在做更多的工作，但内存访问模式对pipe道更加友好。这也显示了权衡。虽然这个代码块的运行速度很快，但是我们在caching中加载了大量的垃圾，这可能会给应用程序的其他部分带来更大的压力。

从文档：

  for (i = 0; i < n; i++) { a[i] = a[i] + b[i]; __builtin_prefetch (&a[i+j], 1, 1); __builtin_prefetch (&b[i+j], 0, 1); /* ... */ }

预取示例？

什么是你见过的最荒谬的悲观化？

与SSE有效的4x4matrix向量乘法：水平添加和点积 – 有什么意义？

为什么是一个简单的循环优化时，极限是959而不是960？

内存分配的时间复杂度

最快的方法来sorting在JavaScript中的32位有符号整数数组？

将Sqrt（x）计算为x * InvSqrt（x）是否在Doom 3 BFG代码中有意义？

为什么妈妈在Haswell上只有3个周期，与Agner的指令表不同？

最快的代码C / C ++来select一组27个浮点值的中位数

在Python中加速位串/位操作？

编译器中的布尔值为8位。对他们的操作是低效的吗？

预取示例？

什么是你见过的最荒谬的悲观化？

与SSE有效的4x4matrix向量乘法：水平添加和点积 – 有什么意义？

为什么是一个简单的循环优化时，极限是959而不是960？

内存分配的时间复杂度

最快的方法来sorting在JavaScript中的32位有符号整数数组？

将Sqrt（x）计算为x * InvSqrt（x）是否在Doom 3 BFG代码中有意义？

为什么妈妈在Haswell上只有3个周期，与Agner的指令表不同？

最快的代码C / C ++来select一组27个浮点值的中位数

在Python中加速位串/位操作？

编译器中的布尔值为8位。 对他们的操作是低效的吗？

编译器中的布尔值为8位。对他们的操作是低效的吗？