Linux上可怜的memcpy性能

我们最近购买了一些新的服务器，并且遇到可怜的memcpy性能。与我们的笔记本电脑相比，memcpy的性能比服务器慢3倍。

服务器规格

底盘和Mobo：超级微型1027GR-TRF
CPU：2x Intel Xeon E5-2680 @ 2.70 Ghz
内存：8x 16GB DDR3 1600MHz

编辑：我也在另一台服务器上testing稍高的规格，并看到与上述服务器相同的结果

服务器2规格

底盘和Mobo：超级微型10227GR-TRFT
CPU：2x Intel Xeon E5-2650 v2 @ 2.6 Ghz
内存：8x 16GB DDR3 1866MHz

笔记本电脑规格

机箱：联想W530
CPU：1x Intel Core i7 i7-3720QM @ 2.6Ghz
内存：4x 4GB DDR3 1600MHz

操作系统

$ cat /etc/redhat-release Scientific Linux release 6.5 (Carbon) $ uname -a Linux r113 2.6.32-431.1.2.el6.x86_64 #1 SMP Thu Dec 12 13:59:19 CST 2013 x86_64 x86_64 x86_64 GNU/Linux

编译器（在所有系统上）

 $ gcc --version gcc (GCC) 4.6.1

还根据@stefan的build议testing了gcc 4.8.2。编译器之间没有性能差异。

testing代码下面的testing代码是一个jar装testing，用于复制我在生产代码中看到的问题。我知道这个基准是简单的，但它能够利用和识别我们的问题。代码在它们之间创build两个1GB缓冲区和memcpys，对memcpy调用进行计时。您可以使用以下命令在命令行上指定备用缓冲区大小：./big_memcpy_test [SIZE_BYTES]

 #include <chrono> #include <cstring> #include <iostream> #include <cstdint> class Timer { public: Timer() : mStart(), mStop() { update(); } void update() { mStart = std::chrono::high_resolution_clock::now(); mStop = mStart; } double elapsedMs() { mStop = std::chrono::high_resolution_clock::now(); std::chrono::milliseconds elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(mStop - mStart); return elapsed_ms.count(); } private: std::chrono::high_resolution_clock::time_point mStart; std::chrono::high_resolution_clock::time_point mStop; }; std::string formatBytes(std::uint64_t bytes) { static const int num_suffix = 5; static const char* suffix[num_suffix] = { "B", "KB", "MB", "GB", "TB" }; double dbl_s_byte = bytes; int i = 0; for (; (int)(bytes / 1024.) > 0 && i < num_suffix; ++i, bytes /= 1024.) { dbl_s_byte = bytes / 1024.0; } const int buf_len = 64; char buf[buf_len]; // use snprintf so there is no buffer overrun int res = snprintf(buf, buf_len,"%0.2f%s", dbl_s_byte, suffix[i]); // snprintf returns number of characters that would have been written if n had // been sufficiently large, not counting the terminating null character. // if an encoding error occurs, a negative number is returned. if (res >= 0) { return std::string(buf); } return std::string(); } void doMemmove(void* pDest, const void* pSource, std::size_t sizeBytes) { memmove(pDest, pSource, sizeBytes); } int main(int argc, char* argv[]) { std::uint64_t SIZE_BYTES = 1073741824; // 1GB if (argc > 1) { SIZE_BYTES = std::stoull(argv[1]); std::cout << "Using buffer size from command line: " << formatBytes(SIZE_BYTES) << std::endl; } else { std::cout << "To specify a custom buffer size: big_memcpy_test [SIZE_BYTES] \n" << "Using built in buffer size: " << formatBytes(SIZE_BYTES) << std::endl; } // big array to use for testing char* p_big_array = NULL; ///////////// // malloc { Timer timer; p_big_array = (char*)malloc(SIZE_BYTES * sizeof(char)); if (p_big_array == NULL) { std::cerr << "ERROR: malloc of " << SIZE_BYTES << " returned NULL!" << std::endl; return 1; } std::cout << "malloc for " << formatBytes(SIZE_BYTES) << " took " << timer.elapsedMs() << "ms" << std::endl; } ///////////// // memset { Timer timer; // set all data in p_big_array to 0 memset(p_big_array, 0xF, SIZE_BYTES * sizeof(char)); double elapsed_ms = timer.elapsedMs(); std::cout << "memset for " << formatBytes(SIZE_BYTES) << " took " << elapsed_ms << "ms " << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)" << std::endl; } ///////////// // memcpy { char* p_dest_array = (char*)malloc(SIZE_BYTES); if (p_dest_array == NULL) { std::cerr << "ERROR: malloc of " << SIZE_BYTES << " for memcpy test" << " returned NULL!" << std::endl; return 1; } memset(p_dest_array, 0xF, SIZE_BYTES * sizeof(char)); // time only the memcpy FROM p_big_array TO p_dest_array Timer timer; memcpy(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); double elapsed_ms = timer.elapsedMs(); std::cout << "memcpy for " << formatBytes(SIZE_BYTES) << " took " << elapsed_ms << "ms " << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)" << std::endl; // cleanup p_dest_array free(p_dest_array); p_dest_array = NULL; } ///////////// // memmove { char* p_dest_array = (char*)malloc(SIZE_BYTES); if (p_dest_array == NULL) { std::cerr << "ERROR: malloc of " << SIZE_BYTES << " for memmove test" << " returned NULL!" << std::endl; return 1; } memset(p_dest_array, 0xF, SIZE_BYTES * sizeof(char)); // time only the memmove FROM p_big_array TO p_dest_array Timer timer; // memmove(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); doMemmove(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); double elapsed_ms = timer.elapsedMs(); std::cout << "memmove for " << formatBytes(SIZE_BYTES) << " took " << elapsed_ms << "ms " << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)" << std::endl; // cleanup p_dest_array free(p_dest_array); p_dest_array = NULL; } // cleanup free(p_big_array); p_big_array = NULL; return 0; }

CMake文件来build立

 project(big_memcpy_test) cmake_minimum_required(VERSION 2.4.0) include_directories(${CMAKE_CURRENT_SOURCE_DIR}) # create verbose makefiles that show each command line as it is issued set( CMAKE_VERBOSE_MAKEFILE ON CACHE BOOL "Verbose" FORCE ) # release mode set( CMAKE_BUILD_TYPE Release ) # grab in CXXFLAGS environment variable and append C++11 and -Wall options set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++0x -Wall -march=native -mtune=native" ) message( INFO "CMAKE_CXX_FLAGS = ${CMAKE_CXX_FLAGS}" ) # sources to build set(big_memcpy_test_SRCS main.cpp ) # create an executable file named "big_memcpy_test" from # the source files in the variable "big_memcpy_test_SRCS". add_executable(big_memcpy_test ${big_memcpy_test_SRCS})

检测结果

 Buffer Size: 1GB | malloc (ms) | memset (ms) | memcpy (ms) | NUMA nodes (numactl --hardware) --------------------------------------------------------------------------------------------- Laptop 1 | 0 | 127 | 113 | 1 Laptop 2 | 0 | 180 | 120 | 1 Server 1 | 0 | 306 | 301 | 2 Server 2 | 0 | 352 | 325 | 2

正如你所看到的，我们服务器上的memcpys和memsets比我们笔记本上的memcpys和memset慢得多。

改变缓冲区大小

我已经尝试了从100MB到5GB的缓冲区，所有这些都有类似的结果（服务器比笔记本电脑慢）

NUMA亲和力

我读了关于NUMA性能问题的人，所以我尝试使用numactl设置CPU和内存的亲和力，但结果保持不变。

服务器NUMA硬件

 $ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 65501 MB node 0 free: 62608 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 65536 MB node 1 free: 63837 MB node distances: node 0 1 0: 10 21 1: 21 10

笔记本电脑NUMA硬件

 $ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 16018 MB node 0 free: 6622 MB node distances: node 0 0: 10

设置NUMA亲和力

 $ numactl --cpunodebind=0 --membind=0 ./big_memcpy_test

任何帮助解决这个不胜感激。

编辑：GCC选项

基于评论我已经尝试编译与不同的GCC选项：

用-march和-mtune编译为本机编译

 g++ -std=c++0x -Wall -march=native -mtune=native -O3 -DNDEBUG -o big_memcpy_test main.cpp

结果：完全相同的performance（没有改进）

用-O2代替-O3编译

 g++ -std=c++0x -Wall -march=native -mtune=native -O2 -DNDEBUG -o big_memcpy_test main.cpp

结果：完全相同的performance（没有改进）

编辑：更改memset写入0xF而不是0，以避免NULL页面（@SteveCox）

使用非0值（在这种情况下使用0xF）进行memsetting时没有任何改进。

编辑：Cachebench结果

为了排除我的testing程序太简单，我下载了一个真正的基准程序LLCacheBench（ http://icl.cs.utk.edu/projects/llcbench/cachebench.html ）

我在每台机器上分别构build了基准，以避免体系结构问题。以下是我的结果。

笔记本电脑与服务器的memcpy性能

请注意，较大的差异是在较大的缓冲区大小上的性能。 testing的最后尺寸（16777216）在笔记本上以18849.29 MB /秒和服务器上的6710.40执行。这大约相差3倍。您还可以注意到，服务器的性能下降比笔记本电脑上的陡峭。

编辑：memmove（）比memcpy（）在服务器上快2倍

基于一些实验，我已经尝试在我的testing用例中使用memmove（）而不是memcpy（），并在服务器上find了2倍的改进。笔记本电脑上的Memmove（）运行速度比memcpy（）慢，但奇怪的是，运行速度与服务器上的memmove（）相同。这引出了问题，为什么memcpy这么慢？

更新了代码，以便与memcpy一起testingmemmove。我不得不把memmove（）包装在一个函数中，因为如果我把它放在内联GCC优化它，并执行完全相同的memcpy（）（我认为gcc优化它为memcpy，因为它知道的位置不重叠）。

更新结果

 Buffer Size: 1GB | malloc (ms) | memset (ms) | memcpy (ms) | memmove() | NUMA nodes (numactl --hardware) --------------------------------------------------------------------------------------------------------- Laptop 1 | 0 | 127 | 113 | 161 | 1 Laptop 2 | 0 | 180 | 120 | 160 | 1 Server 1 | 0 | 306 | 301 | 159 | 2 Server 2 | 0 | 352 | 325 | 159 | 2

编辑：天真Memcpy

根据@Salgar的build议，我已经实现了我自己的朴素memcpy函数并对其进行了testing。

天真Memcpy来源

 void naiveMemcpy(void* pDest, const void* pSource, std::size_t sizeBytes) { char* p_dest = (char*)pDest; const char* p_source = (const char*)pSource; for (std::size_t i = 0; i < sizeBytes; ++i) { *p_dest++ = *p_source++; } }

天真的Memcpy结果与memcpy（）相比

 Buffer Size: 1GB | memcpy (ms) | memmove(ms) | naiveMemcpy() ------------------------------------------------------------ Laptop 1 | 113 | 161 | 160 Server 1 | 301 | 159 | 159 Server 2 | 325 | 159 | 159

编辑：assembly输出

简单的memcpy源码

 #include <cstring> #include <cstdlib> int main(int argc, char* argv[]) { size_t SIZE_BYTES = 1073741824; // 1GB char* p_big_array = (char*)malloc(SIZE_BYTES * sizeof(char)); char* p_dest_array = (char*)malloc(SIZE_BYTES * sizeof(char)); memset(p_big_array, 0xA, SIZE_BYTES * sizeof(char)); memset(p_dest_array, 0xF, SIZE_BYTES * sizeof(char)); memcpy(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); free(p_dest_array); free(p_big_array); return 0; }

assembly输出：在服务器和笔记本电脑上完全相同。我正在节省空间，而不是两者都粘贴。

  .file "main_memcpy.cpp" .section .text.startup,"ax",@progbits .p2align 4,,15 .globl main .type main, @function main: .LFB25: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movl $1073741824, %edi pushq %rbx .cfi_def_cfa_offset 24 .cfi_offset 3, -24 subq $8, %rsp .cfi_def_cfa_offset 32 call malloc movl $1073741824, %edi movq %rax, %rbx call malloc movl $1073741824, %edx movq %rax, %rbp movl $10, %esi movq %rbx, %rdi call memset movl $1073741824, %edx movl $15, %esi movq %rbp, %rdi call memset movl $1073741824, %edx movq %rbx, %rsi movq %rbp, %rdi call memcpy movq %rbp, %rdi call free movq %rbx, %rdi call free addq $8, %rsp .cfi_def_cfa_offset 24 xorl %eax, %eax popq %rbx .cfi_def_cfa_offset 16 popq %rbp .cfi_def_cfa_offset 8 ret .cfi_endproc .LFE25: .size main, .-main .ident "GCC: (GNU) 4.6.1" .section .note.GNU-stack,"",@progbits

进展！！！！ ASMLib程序

根据@tbenson的build议，我试着用memcpy的asmlib版本运行。我的结果最初很差，但是在将SetMemcpyCacheLimit（）更改为1GB（缓冲区的大小）之后，我的速度与我天真的循环一致！

坏消息是，memmove的asmlib版本比glibc版本慢，现在运行在300毫秒（与memcpy的glibc版本相当）。奇怪的是，在笔记本电脑上，当我SetMemcpyCacheLimit（）大量伤害性能… … –

在下面的结果中，用SetCache标记的行将SetMemcpyCacheLimit设置为1073741824.没有SetCache的结果不调用SetMemcpyCacheLimit（）

结果使用asmlib中的函数：

 Buffer Size: 1GB | memcpy (ms) | memmove(ms) | naiveMemcpy() ------------------------------------------------------------ Laptop | 136 | 132 | 161 Laptop SetCache | 182 | 137 | 161 Server 1 | 305 | 302 | 164 Server 1 SetCache | 162 | 303 | 164 Server 2 | 300 | 299 | 166 Server 2 SetCache | 166 | 301 | 166

开始倾向于caching问题，但是会导致什么呢？

[我会做这个评论，但是没有足够的声望去做。]

我有一个类似的系统，看到类似的结果，但可以添加几个数据点：

如果你改变了你朴素的memcpy的方向（即转换为*p_dest-- = *p_src-- ），那么你的性能可能比正向（ *p_dest-- = *p_src-- ）差得多。 glibc 2.12中的memcpy()发生了一些变化，这些变化暴露了几个在重叠缓冲区（ http://lwn.net/Articles/414467/ ）上调用memcpy bug，我相信这个问题是由于切换到一个memcpy版本引起的倒退。所以，向后拷贝或向前拷贝可以解释memcpy() / memmove()差异。
不使用非临时商店似乎更好。许多经过优化的memcpy()实现切换到大型缓冲区（即大于最后一级caching）的非临时存储区（未caching）。我testing了Agner Fog的memcpy版本（ http://www.agner.org/optimize/#asmlib ），发现它和glibc的版本大致相同。但是， asmlib具有一个函数（ SetMemcpyCacheLimit ），它允许设置超过其使用非临时存储的阈值。将此限制设置为8GiB（或者仅比1 GiB缓冲区大），以避免非临时存储在我的情况下（时间低至176ms）性能翻倍。当然，这只是与前瞻性的天真performance相匹配，所以并不是很出色。
这些系统上的BIOS允许启用/禁用四个不同的硬件预取程序（MLC Streamer预取程序，MLC空间预取程序，DCU Streamer预取程序和DCU IP预取程序）。我试图禁用每一个，但是这样做最多保持了性能平衡并降低了一些设置的性能。
禁用运行平均功率限制（RAPL）DRAM模式不会产生影响。
我可以访问运行Fedora 19的其他Supermicro系统（glibc 2.17）。使用Supermicro X9DRG-HF板，Fedora 19和Xeon E5-2670 CPU时，我看到与上述类似的性能。在运行至强E3-1275 v3（Haswell）和Fedora 19的Supermicro X10SLM-F单sockets板上， memcpy （104ms）的速度为9.6 GB / s。 Haswell系统上的RAM是DDR3-1600（与其他系统相同）。

更新

我将CPU电源pipe理设置为最高性能，并禁用BIOS中的超线程。基于/proc/cpuinfo ，核心然后在3GHz时钟。但是，这个奇怪的降低了10％左右的内存性能。
memtest86 + 4.10向主内存报告9091 MB / s的带宽。我找不到这是否对应于读取，写入或复制。
STREAM基准testing报告的拷贝数为13422 MB / s，但是它们都是以读取和写入的方式计算字节数，所以如果我们想要比较上面的结果，那么相当于〜6.5 GB / s。

这对我来说很正常。

使用两个CPUpipe理8x16GB ECC内存条比使用2x2GB的单个CPU要困难得多。你的16GB棒是双面内存+他们可能有缓冲区+ ECC（甚至在主板级别禁用）…所有这些都使得数据通路更长。你也有2个CPU共享内存，即使你在另一个CPU上什么都不做，内存总是很less。切换这些数据需要一些额外的时间。只要看看与显卡共享内存的电脑所带来的巨大性能。

还是你的服务器真的是强大的数据泵。我不确定在现实生活软件中经常出现重复1GB事件的情况，但是我确定你的128GB比任何硬盘都快，即使是最好的SSD也是如此，你可以在这里使用你的服务器。使用3GB进行相同的testing将使您的笔记本电脑着火。

这看起来就像是一个完美的例子，说明基于商品硬件的体系结构如何比大型服务器更有效率。在这些大型服务器上花费的钱可以消费多less台PC呢？

感谢您提出非常详细的问题。

编辑：（花了我很长时间来写这个答案，我错过了图表的一部分。）

我认为这个问题是关于数据存储的地方。你可以请比较一下：

testing一：分配两个连续的500Mb ram块并从一个副本复制到另一个（你已经完成了）
testing二：分配20个（或更多）500Mb的内存块，从第一个到最后一个拷贝，因此它们彼此远离（即使你不能确定它们的真实位置）。

这样你会看到内存控制器如何处理彼此远离的内存块。我认为你的数据被放在不同的内存区域，并且它需要在数据path的某个点与另一个区域进行交换操作（在双面内存中存在这样的问题）。

另外，你是否确保线程绑定到一个CPU？

编辑2：

内存有几种“区域”分隔符。 NUMA是一个，但这不是唯一的一个。例如双面棍子需要一个标志来解决一方或另一方。看看你的图表，即使在笔记本电脑上（即没有NUMA），性能也会随着大块内存的下降而下降。我不知道这一点，但memcpy可能会使用硬件function复制内存（一种DMA），这个芯片必须比你的CPU有更less的caching，这可以解释为什么与CPU的愚笨的副本比memcpy更快。

基于IvyBridge的笔记本电脑的某些CPU性能可能会比基于SandyBridge的服务器有所提高。

页面交叉预取 – 当你到达当前页面的末尾时，你的笔记本电脑CPU将预先读取下一个线性页面，每次都会为你节省一笔令人讨厌的TLB。要尝试缓解这一点，请尝试构build2M / 1G页面的服务器代码。
高速cachingreplacescheme似乎也得到了改进（请参阅这里有趣的逆向工程）。如果这个CPU确实使用了一个dynamic插入策略，那么它将很容易地防止你的复制数据试图冲击你的最后一级caching（由于它的大小，它不能有效地使用），并且保留空间用于其他有用的caching如代码，堆栈，页表数据等）。为了testing这个，你可以尝试使用stream加载/存储（ movntdq或类似的，你也可以使用gcc内置）重build你的天真的实现。这种可能性可能解释了大数据集规模的突然下降。
我相信也使用了string复制（这里）进行了一些改进，这里可能会也可能不适用，这取决于您的汇编代码的外观。您可以尝试与Dhrystone进行基准testing，以testing是否存在固有差异。这也可以解释memcpy和memmove之间的区别。

如果你能得到一台基于IvyBridge的服务器或一台Sandy-Bridge笔记本电脑，那么一起testing所有这些将是最简单的。

我修改了基准，使用Linux中的nsec定时器，发现在不同的处理器上有类似的变化，所有处理器都有类似的内存。所有运行的RHEL 6.数字在多次运行中都是一致的。

 Sandy Bridge E5-2648L v2 @ 1.90GHz, HT enabled, L2/L3 256K/20M, 16 GB ECC malloc for 1073741824 took 47us memset for 1073741824 took 643841us memcpy for 1073741824 took 486591us Westmere E5645 @2.40 GHz, HT not enabled, dual 6-core, L2/L3 256K/12M, 12 GB ECC malloc for 1073741824 took 54us memset for 1073741824 took 789656us memcpy for 1073741824 took 339707us Jasper Forest C5549 @ 2.53GHz, HT enabled, dual quad-core, L2 256K/8M, 12 GB ECC malloc for 1073741824 took 126us memset for 1073741824 took 280107us memcpy for 1073741824 took 272370us

这里是内联C代码-O3的结果

 Sandy Bridge E5-2648L v2 @ 1.90GHz, HT enabled, 256K/20M, 16 GB malloc for 1 GB took 46 us memset for 1 GB took 478722 us memcpy for 1 GB took 262547 us Westmere E5645 @2.40 GHz, HT not enabled, dual 6-core, 256K/12M, 12 GB malloc for 1 GB took 53 us memset for 1 GB took 681733 us memcpy for 1 GB took 258147 us Jasper Forest C5549 @ 2.53GHz, HT enabled, dual quad-core, 256K/8M, 12 GB malloc for 1 GB took 67 us memset for 1 GB took 254544 us memcpy for 1 GB took 255658 us

对于这个问题，我也尝试使内联memcpy一次做8个字节。在这些英特尔处理器上并没有明显的差别。 caching将所有字节操作合并到最小数量的内存操作中。我怀疑海湾合作委员会的代码库试图太聪明。

这个问题已经在上面得到了解答，但是在任何情况下，这里都是一个使用AVX的实现，如果这是您所担心的，那么对于大型副本应该更快一些：

 #define ALIGN(ptr, align) (((ptr) + (align) - 1) & ~((align) - 1)) void *memcpy_avx(void *dest, const void *src, size_t n) { char * d = static_cast<char*>(dest); const char * s = static_cast<const char*>(src); /* fall back to memcpy() if misaligned */ if ((reinterpret_cast<uintptr_t>(d) & 31) != (reinterpret_cast<uintptr_t>(s) & 31)) return memcpy(d, s, n); if (reinterpret_cast<uintptr_t>(d) & 31) { uintptr_t header_bytes = 32 - (reinterpret_cast<uintptr_t>(d) & 31); assert(header_bytes < 32); memcpy(d, s, min(header_bytes, n)); d = reinterpret_cast<char *>(ALIGN(reinterpret_cast<uintptr_t>(d), 32)); s = reinterpret_cast<char *>(ALIGN(reinterpret_cast<uintptr_t>(s), 32)); n -= min(header_bytes, n); } for (; n >= 64; s += 64, d += 64, n -= 64) { __m256i *dest_cacheline = (__m256i *)d; __m256i *src_cacheline = (__m256i *)s; __m256i temp1 = _mm256_stream_load_si256(src_cacheline + 0); __m256i temp2 = _mm256_stream_load_si256(src_cacheline + 1); _mm256_stream_si256(dest_cacheline + 0, temp1); _mm256_stream_si256(dest_cacheline + 1, temp2); } if (n > 0) memcpy(d, s, n); return dest; }

这些数字对我来说是有意义的。其实有两个问题，我会回答他们两个。

首先，我们需要有一个心智模型，说明如何使用现代英特尔处理器来传输大量内存。这个描述是近似的 ，细节可能会有所不同，从架构到架构，但高层次的想法是相当稳定的。

当L1数据caching中的负载未命中时，会分配一个行缓冲区 ，它将跟踪未命中请求，直到它被填充。如果它在L2高速缓冲存储器中，这可能是短时间（十几个周期左右），或者如果它一直错过DRAM，则可能长得多（100+纳秒）。
每个内核 ^1的线路缓冲器数量有限，一旦它们满了，进一步的失误就会停顿等待一个。
除了用于请求^3个加载/存储的这些填充缓冲器之外，DRAM和L2之间的存储器移动以及预取所使用的较低级别的caching还有额外的缓冲器。
内存子系统本身具有最大带宽限制 ，您可以在ARK上方便地find它。例如，联想笔记本电脑中的3720QM显示25.6 GB的限制。该限制基本上是每个传输的有效频率（ 1600 Mhz ）乘以8个字节（64位）乘以通道数量（2）的乘积： 1600 * 8 * 2 = 25.6 GB/s 。手头的服务器芯片每个插槽的峰值带宽为51.2 GB / s ，总系统带宽为〜102 GB / s。

与其他处理器function不同的是，整个芯片通常只有一个可能的理论带宽数字，因为它仅取决于在许多不同芯片上，甚至在架构上经常是相同的值。由于各种低层次的担忧，这里讨论一下，理论上讲，DRAM是不切实际的，但是通常可以达到90％以上。

所以（1）的主要结果是你可以把错过的内存作为一种请求响应系统。 DRAM未命中分配一个填充缓冲区 ，当请求返回时缓冲区被释放。每个CPU只有10个缓冲区用于未命中需求，这将严格限制单个CPU可产生的需求内存带宽，这是其延迟的函数。

例如，可以说你的E5-2680有80ns的延迟。每个请求都会引入一个64字节的高速caching行，所以你只是向DRAM发出请求，你会期望64 bytes / 80 ns = 0.8 GB/s的吞吐量，而且你会把它再次减半）得到一个memcpy数字，因为它需要读取和写入。幸运的是，您可以使用10个行填充缓冲区，因此可以将10个并发请求重叠到内存中，并将带宽增加10倍，从而产生8 GB / s的理论带宽。

如果你想挖掘更多的细节，这个线程是非常纯粹的黄金。你会发现John McCalpin的事实和数据，也就是下面的Dr Bandwidth博士。

那么让我们进入细节并回答两个问题

为什么memcpy比memmove或者手动拷贝在服务器上慢得多？

您显示，笔记本电脑系统在大约120 ms内执行memcpy基准testing，而服务器部件大约需要300 ms 。你还表明，这种缓慢大多不是基本的，因为你可以使用memmove和你的手动memcpy（以下简称hrm ）达到约160ms的时间，离笔记本电脑性能更近（但仍然比较慢）。

我们已经在上面展示了单核的带宽受限于可用的并发和延迟，而不是DRAM的带宽。我们预计服务器部件可能会有更长的延迟，但不会300 / 120 = 2.5x ！

答案在于stream式（又称非时间性）商店 。您使用的memcpy的libc版本使用它们，但是memmove不会。你用你的“天真”的memcpy ，也没有使用它们，以及我的configurationasmlib都使用stream媒体存储（慢），而不是（快）。

stream媒体商店伤害了单个CPU号码，因为：

（A）它们防止预取将待存储的行引入到caching中，这允许更多的并发性，因为预取硬件具有需要加载/存储使用的10个填充缓冲区以外的其他专用缓冲区。
（B）据了解，E5-2680在stream媒体商店中速度特别慢。

这两个问题都可以通过上述链接线索中John McCalpin的引用来解释。关于预取有效性和stream式商店的话题，他说：

使用“普通”存储，L2硬件预取器可以提前获取线路，减less线路填充缓冲器占用的时间，从而增加持续的带宽。另一方面，在stream式（高速caching旁路）存储器中，存储器的行填充缓冲器条目被占据全部时间以将数据传递到DRAM控制器。在这种情况下，负载可以通过硬件预取来加速，但是商店不能，所以你得到了一些加速，但是如果加载和存储都加速，你就不会得到那么多。

…然后对于E5上stream式存储的显然更长的延迟，他说：

至强E3的更简单的“非核心”可能导致stream式存储的行填充缓冲区占用率大大降低。 Xeon E5具有更复杂的环形结构来导航，以便将stream存储器从核心缓冲区切换到内存控制器，因此占用率可能会比内存（读取）延迟的差距更大。

尤其是，McCalpin博士测量E5相比，与“客户”uncore芯片减less约1.8倍的放缓，但业绩报告的2.5倍放缓是一致的，因为1.8倍的分数据报道为STREAM TRIAD 2：1的load：stores比率，而memcpy比例是1：1，商店是问题的部分。

这不会使stream式传输成为一件坏事 – 事实上，您正在为延迟更小的总带宽消耗而牺牲时间。由于使用单个内核时并发性受到限制，因此带宽更less，但避免了所有的读取所有权stream量，因此如果您在所有内核上同时运行testing，则可能会看到（小）收益。

到目前为止，从软件或硬件configuration文件来看，其他用户使用相同的CPU也报告了相同的减速。

为什么在使用普通商店时服务器部分仍然较慢？

即使在纠正了非暂时性商店问题之后，服务器部分仍然会出现大约160 / 120 = ~1.33x放缓。是什么赋予了？

Well it's a common fallacy that server CPUs are faster in all respects faster or at least equal to their client counterparts. It's just not true – what you are paying for (often at $2,000 a chip or so) on the server parts is mostly (a) more cores (b) more memory channels (c) support for more total RAM (d) support for "enterprise-ish" features like ECC, virutalization features, etc ⁵ .

In fact, latency-wise, server parts are usually only equal or slower to their client ⁴ parts. When it comes to memory latency, this is especially true, because:

The server parts have a more scalable, but complex "uncore" that often needs to support many more cores and consequently the path to RAM is longer.
The server parts support more RAM (100s of GB or a few TB) which often requires electrical buffers to support such a large quantity.
As in the OP's case server parts are usually multi-socket, which adds cross-socket coherence concerns to the memory path.

So it is typical that server parts have a latency 40% to 60% longer than client parts. For the E5 you'll probably find that ~80 ns is a typical latency to RAM, while client parts are closer to 50 ns.

So anything that is RAM latency constrained will run slower on server parts, and as it turns out, memcpy on a single core is latency constrained. that's confusing because memcpy seems like a bandwidth measurement, right? Well as described above, a single core doesn't have enough resources to keep enough requests to RAM in flight at a time to get close to the RAM bandwidth ⁶ , so performance depends directly on latency.

The client chips, on the other hand, have both lower latency and lower bandwidth, so one core comes much closer to saturating the bandwidth (this is often why streaming stores are a big win on client parts – when even a single core can approach the RAM bandwidth, the 50% store bandwidth reduction that stream stores offers helps a lot.

参考

There are lots of good sources to read more on this stuff, here are a couple.

A detailed description of memory latency components
Lots of memory latency results across CPUs new and old (see the MemLatX86 and NewMemLat ) links
Detailed analysis of Sandy Bridge (and Opteron) memory latencies – almost the same chip the OP is using.

¹ By large I just mean somewhat larger than the LLC. For copies that fit in the LLC (or any higher cache level) the behavior is very different. The OPs llcachebench graph shows that in fact the performance deviation only starts when the buffers start to exceed the LLC size.

² In particular, the number of line fill buffers has apparently been constant at 10 for several generations, including the architectures mentioned in this question.

³ When we say demand here, we mean that it is associated with an explicit load/store in the code, rather than say being brought in by a prefetch.

⁴ When I refer to a server part here, I mean a CPU with a server uncore . This largely means the E5 series, as the E3 series generally uses the client uncore .

⁵ In the future, it looks like you can add "instruction set extensions" to this list, as it seems that AVX-512 will appear only on the Skylake server parts.

⁶ Per little's law at a latency of 80 ns, we'd need (51.2 B/ns * 80 ns) == 4096 bytes or 64 cache lines in flight at all times to reach the maximum bandwidth, but one core provides less than 20.

Server 1 Specs

CPU: 2x Intel Xeon E5-2680 @ 2.70 Ghz

Server 2 Specs

CPU: 2x Intel Xeon E5-2650 v2 @ 2.6 Ghz

According to Intel ARK, both the E5-2650 and E5-2680 have AVX extension.

CMake File to Build

This is part of your problem. CMake chooses some rather poor flags for you. You can confirm it by running make VERBOSE=1 .

You should add both -march=native and -O3 to your CFLAGS and CXXFLAGS . You will likely see a dramatic performance increase. It should engage the AVX extensions. Without -march=XXX , you effectively get a minimal i686 or x86_64 machine. Without -O3 , you don't engage GCC's vectorizations.

I'm not sure if GCC 4.6 is capable of AVX (and friends, like BMI). I know GCC 4.8 or 4.9 is capable because I had to hunt down an alignment bug that was causing a segfault when GCC was outsourcing memcpy's and memset's to the MMX unit. AVX and AVX2 allow the CPU to operate on 16-byte and 32-byte blocks of data at a time.

If GCC is missing an opportunity to send aligned data to the MMX unit, it may be missing the fact that data is aligned. If your data is 16-byte aligned, then you might try telling GCC so it knows to operate on fat blocks. For that, see GCC's __builtin_assume_aligned . Also see questions like How to tell GCC that a pointer argument is always double-word-aligned?

This also looks a little suspect because of the void* . Its kind of throwing away information about the pointer. You should probably keep the information:

 void doMemmove(void* pDest, const void* pSource, std::size_t sizeBytes) { memmove(pDest, pSource, sizeBytes); }

Maybe something like the following:

 template <typename T> void doMemmove(T* pDest, const T* pSource, std::size_t count) { memmove(pDest, pSource, count*sizeof(T)); }

Another suggestion is to use new , and stop using malloc . Its a C++ program and GCC can make some assumptions about new that it cannot make about malloc . I believe some of the assumptions are detailed in GCC's option page for the built-ins.

Yet another suggestion is to use the heap. Its always 16-byte aligned on typical modern systems. GCC should recognize it can offload to the MMX unit when a pointer from the heap is involved (sans the potential void* and malloc issues).

Finally, for a while, Clang was not using the native CPU extensions when using -march=native . See, for example, Ubuntu Issue 1616723, Clang 3.4 only advertises SSE2 , Ubuntu Issue 1616723, Clang 3.5 only advertises SSE2 , and Ubuntu Issue 1616723, Clang 3.6 only advertises SSE2 .