为memcpy增强了REP MOVSB

我想使用增强的REP MOVSB(ERMSB)为自定义memcpy获取高带宽。

ERMSB是与Ivy Bridge微架构一起推出的。 如果您不知道ERMSB是什么,请参阅英特尔优化手册中的“增强型REP MOVSB和STOSB操作(ERMSB)”部分。

我知道直接做这件事的唯一方法是使用内联汇编。 我从https://groups.google.com/forum/#!topic/gnu.gcc.help/-Bmlm_EG_fE获得了以下function

 static inline void *__movsb(void *d, const void *s, size_t n) { asm volatile ("rep movsb" : "=D" (d), "=S" (s), "=c" (n) : "0" (d), "1" (s), "2" (n) : "memory"); return d; } 

然而,当我使用这个时,带宽比memcpyless得多。 使用我的i7-6700HQ(Skylake)系统,Ubuntu 16.10,DDR4 @ 2400 MHz双通道32 GB,GCC 6.2, __movsb获得15 GB / s, memcpy获得26 GB / s。

为什么带有REP MOVSB的带宽如此之低? 我能做些什么来改善它?

这是我用来testing这个的代码。

 //gcc -O3 -march=native -fopenmp foo.c #include <stdlib.h> #include <string.h> #include <stdio.h> #include <stddef.h> #include <omp.h> #include <x86intrin.h> static inline void *__movsb(void *d, const void *s, size_t n) { asm volatile ("rep movsb" : "=D" (d), "=S" (s), "=c" (n) : "0" (d), "1" (s), "2" (n) : "memory"); return d; } int main(void) { int n = 1<<30; //char *a = malloc(n), *b = malloc(n); char *a = _mm_malloc(n,4096), *b = _mm_malloc(n,4096); memset(a,2,n), memset(b,1,n); __movsb(b,a,n); printf("%d\n", memcmp(b,a,n)); double dtime; dtime = -omp_get_wtime(); for(int i=0; i<10; i++) __movsb(b,a,n); dtime += omp_get_wtime(); printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime); dtime = -omp_get_wtime(); for(int i=0; i<10; i++) memcpy(b,a,n); dtime += omp_get_wtime(); printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime); } 

我对rep movsb感兴趣的原因是基于这些评论

请注意,在Ivybridge和Haswell,缓冲区大到适合MLC你可以击败mov movdqa使用rep movsb; movntdqa招致RFO进入LLC,rep movsb不会… rep movsb比stream传到Ivybridge和Haswell的内存时要快得多(但是要知道,Ivybridge之前的速度很慢!)

这个memcpy实现中有什么缺失/次优?


这是我从tinymembnech在同一个系统上的结果。

  C copy backwards : 7910.6 MB/s (1.4%) C copy backwards (32 byte blocks) : 7696.6 MB/s (0.9%) C copy backwards (64 byte blocks) : 7679.5 MB/s (0.7%) C copy : 8811.0 MB/s (1.2%) C copy prefetched (32 bytes step) : 9328.4 MB/s (0.5%) C copy prefetched (64 bytes step) : 9355.1 MB/s (0.6%) C 2-pass copy : 6474.3 MB/s (1.3%) C 2-pass copy prefetched (32 bytes step) : 7072.9 MB/s (1.2%) C 2-pass copy prefetched (64 bytes step) : 7065.2 MB/s (0.8%) C fill : 14426.0 MB/s (1.5%) C fill (shuffle within 16 byte blocks) : 14198.0 MB/s (1.1%) C fill (shuffle within 32 byte blocks) : 14422.0 MB/s (1.7%) C fill (shuffle within 64 byte blocks) : 14178.3 MB/s (1.0%) --- standard memcpy : 12784.4 MB/s (1.9%) standard memset : 30630.3 MB/s (1.1%) --- MOVSB copy : 8712.0 MB/s (2.0%) MOVSD copy : 8712.7 MB/s (1.9%) SSE2 copy : 8952.2 MB/s (0.7%) SSE2 nontemporal copy : 12538.2 MB/s (0.8%) SSE2 copy prefetched (32 bytes step) : 9553.6 MB/s (0.8%) SSE2 copy prefetched (64 bytes step) : 9458.5 MB/s (0.5%) SSE2 nontemporal copy prefetched (32 bytes step) : 13103.2 MB/s (0.7%) SSE2 nontemporal copy prefetched (64 bytes step) : 13179.1 MB/s (0.9%) SSE2 2-pass copy : 7250.6 MB/s (0.7%) SSE2 2-pass copy prefetched (32 bytes step) : 7437.8 MB/s (0.6%) SSE2 2-pass copy prefetched (64 bytes step) : 7498.2 MB/s (0.9%) SSE2 2-pass nontemporal copy : 3776.6 MB/s (1.4%) SSE2 fill : 14701.3 MB/s (1.6%) SSE2 nontemporal fill : 34188.3 MB/s (0.8%) 

请注意,在我的系统上, SSE2 copy prefetched也比MOVSB copy更快。


在我原来的testing中,我没有禁用涡轮增压。 我禁用了涡轮,并再次testing,似乎并没有太大的区别。 但是,改变电源pipe理确实有很大的不同。

当我这样做

 sudo cpufreq-set -r -g performance 

我有时候会看到超过20GB /秒的rep movsb

 sudo cpufreq-set -r -g powersave 

我看到的最好的是大约17 GB /秒。 但是memcpy似乎对电源pipe理并不敏感。


我检查了使用和不使用SpeedStep的频率(使用turbostat ), performancepowersave闲置,1核心负载和4核心负载。 我运行英特尔的MKL密集matrix乘法来创build一个负载,并使用OMP_SET_NUM_THREADS设置线程数。 这是一个结果表格(以GHz为单位的数字)。

  SpeedStep idle 1 core 4 core powersave OFF 0.8 2.6 2.6 performance OFF 2.6 2.6 2.6 powersave ON 0.8 3.5 3.1 performance ON 3.5 3.5 3.1 

这表明,即使在SpeedStep禁用的情况下,如果使用powersave ,CPU仍然会降至0.8 GHz的空闲频率。 只有没有SpeedStep的performance ,CPU才能以恒定频率运行。

我用例如sudo cpufreq-set -r performance (因为cpufreq-set给出了奇怪的结果)来改变电源设置。 这又打开了涡轮,所以我不得不禁用涡轮后。

这是一个非常接近我的心脏和最近的调查的话题,所以我会从几个angular度来看它:历史,一些技术笔记(主要是学术),在我的盒子上的testing结果,最后试图回答你的实际问题什么时候和什么地方rep movsb可能是有道理的。

部分,这是一个分享结果呼吁 – 如果你可以运行Tinymembench并分享结果以及你的CPU和RAMconfiguration的细节,那就太好了。 特别是如果你有一个4通道的设置,一个常春藤桥框,一个服务器盒等

历史和官方build议

快速string复制指令的执行历史已经是一个阶梯式的事情 – 即停滞的性能交替与大升级,使他们排队​​或甚至比竞争的方法更快。 例如,Nehalem的性能(主要是针对启动开销)和Ivy Bridge(大多数针对大型副本的总吞吐量)都有所提高。 在这个主题中,您可以find十年前对英特尔工程师执行rep movs指令的困难的见解。

例如,在引入常春藤桥之前的指南中,典型的build议是避免它们或者非常仔细地使用它们1

目前(嗯,2016年6月)指南有很多令人困惑和有点不一致的build议,比如2

基于数据布局,alignment和计数器(ECX)值在执行时select实现的特定变体。 例如,具有REP前缀的MOVSB / STOSB应该使用计数器值小于或等于3以获得最佳性能。

那么对于3个或更less字节的副本? 你首先不需要一个rep前缀,因为声称的启动延迟约为9个周期,你几乎可以肯定用一个简单的DWORD或QWORD mov有点好处,只需要一点位就可以屏蔽掉未使用的字节(或者如果你知道大小恰好是3,那么也许有2个明确的字节,字mov s)。

他们继续说:

stringMOVE / STORE指令具有多个数据粒度。 为了高效的数据移动,更大的数据粒度是优选的。 这意味着更好的效率可以通过将任意计数器值分解成多个双字加单字节移动并且计数值小于或等于3来实现。

这在ERMSB当前的硬件上似乎是错误的,其中rep movsb至less比movdmovq变体对于大拷贝更快或更快。

总的来说,本指南的第3.7.5部分包含了一些合理的和严重过时的build议。 这是英特尔手册的通用吞吐量,因为它们是针对每种架构以增量方式进行更新的(即使在当前的手册中,也将涵盖近二十年的架构),旧版本通常不会更新以替代或提供条件性build议这不适用于当前的体系结构。

然后他们继续在3.7.6节中详细介绍ERMSB。

我不会详尽地回顾剩下的build议,但是我将在下面的“为什么使用它”中总结好的部分。

本指南的其他重要声明是在Haswell上, rep movsb已得到增强,可在内部使用256位操作。

技术考虑

这只是从实现的angular度来看rep指令的潜在优点和缺点的简要总结。

rep movs优势

  1. 当一个rep movs指令被发出时,CPU 知道一个已知大小的整个块将被传送。 这可以帮助优化操作,使其不能用分立指令进行操作,例如:

    • 知道整个caching行时,避免RFO请求将被覆盖。
    • 立即准确地发出预取​​请求。 硬件预取在检测类似memcpy的模式方面做得很好,但是它仍然需要一些读取操作,并且会在复制区域的末尾“超预取”许多caching行。 rep movsb确切地知道区域大小,并可以准确地预取。
  2. 显然,不能保证在3个以内的商店之间进行sorting,可以帮助简化一致性stream量和简单的块的其他方面移动,而简单的mov指令必须遵守相当严格的记忆顺序4

  3. 原则上, rep movs指令可以利用ISA中未公开的各种架构技巧。 例如,架构可能有更广泛的内部数据path,ISA公开5rep movs可以在内部使用。

缺点

  1. rep movsb必须实现可能比底层软件要求更强的特定语义。 特别是, memcpy禁止重叠区域,所以可能会忽略这种可能性,但是rep movsb允许它们并且必须产生预期的结果。 目前的实现主要影响启动开销,但可能不会影响大块吞吐量。 同样,即使你真的使用它来复制2的大幂的倍数的大块, rep movsb必须支持字节粒度的副本。

  2. 如果使用rep movsb ,该软件可能具有关于alignment,复制大小和可能的别名信息,这些信息无法传送到硬件。 编译器通常可以确定内存块6的alignment方式,因此可以避免rep movs每次调用时必须执行的大部分启动工作。

检测结果

这里是testing结果许多不同的复制方法从我的i7-6700HQ在2.6 GHz的tinymembench(太糟糕了,我有相同的CPU,所以我们没有得到一个新的数据点…):

  C copy backwards : 8284.8 MB/s (0.3%) C copy backwards (32 byte blocks) : 8273.9 MB/s (0.4%) C copy backwards (64 byte blocks) : 8321.9 MB/s (0.8%) C copy : 8863.1 MB/s (0.3%) C copy prefetched (32 bytes step) : 8900.8 MB/s (0.3%) C copy prefetched (64 bytes step) : 8817.5 MB/s (0.5%) C 2-pass copy : 6492.3 MB/s (0.3%) C 2-pass copy prefetched (32 bytes step) : 6516.0 MB/s (2.4%) C 2-pass copy prefetched (64 bytes step) : 6520.5 MB/s (1.2%) --- standard memcpy : 12169.8 MB/s (3.4%) standard memset : 23479.9 MB/s (4.2%) --- MOVSB copy : 10197.7 MB/s (1.6%) MOVSD copy : 10177.6 MB/s (1.6%) SSE2 copy : 8973.3 MB/s (2.5%) SSE2 nontemporal copy : 12924.0 MB/s (1.7%) SSE2 copy prefetched (32 bytes step) : 9014.2 MB/s (2.7%) SSE2 copy prefetched (64 bytes step) : 8964.5 MB/s (2.3%) SSE2 nontemporal copy prefetched (32 bytes step) : 11777.2 MB/s (5.6%) SSE2 nontemporal copy prefetched (64 bytes step) : 11826.8 MB/s (3.2%) SSE2 2-pass copy : 7529.5 MB/s (1.8%) SSE2 2-pass copy prefetched (32 bytes step) : 7122.5 MB/s (1.0%) SSE2 2-pass copy prefetched (64 bytes step) : 7214.9 MB/s (1.4%) SSE2 2-pass nontemporal copy : 4987.0 MB/s 

一些关键要点:

  • rep movs方法比所有其他非“非暂时性”方法快7 ,比一次复制8个字节的“C”方法快得多。
  • “non-temporal”方法比rep movs方法快了26%,但是这比你所报告的rep movs要小得多(26 GB / s vs 15 GB / s =〜73%)。
  • 如果您不使用非临时存储,则使用来自C的8字节副本几乎与128位宽的SSE加载/存储一样好。 这是因为良好的复制循环可以产生足够的内存压力来饱和带宽(例如,商店的2.6 GHz * 1存储/周期* 8字节= 26 GB /秒)。
  • 在tinymembench中没有明确的256位algorithm(除了可能是“标准” memcpy ),但由于上面的说明可能没有关系。
  • 非临时存储方法在时间存储方法上的吞吐量增加了大约1.45倍,这非常接近于你预期的1.5倍,如果NT消除3个传输中的1个(即1个读取,1个写入NT对2读,1写)。 rep movs方法在中间。
  • 相当低的内存延迟和适中的2通道带宽的组合意味着这个特定的芯片恰好能够从单线程饱和它的存储器带宽,这大大改变了行为。
  • rep movsd似乎在这个芯片上使用了与rep movsb相同的魔法。 这很有趣,因为ERMSB只明确地针对movsb并且早期的testing使用ERMSB show movsb执行速度比movsd 。 这主要是学术的,因为movsbmovsd更普遍。

Haswell的

查看iwillnotexist在评论中提供的Haswell结果 ,我们看到相同的一般趋势(提取的最相关的结果):

  C copy : 6777.8 MB/s (0.4%) standard memcpy : 10487.3 MB/s (0.5%) MOVSB copy : 9393.9 MB/s (0.2%) MOVSD copy : 9155.0 MB/s (1.6%) SSE2 copy : 6780.5 MB/s (0.4%) SSE2 nontemporal copy : 10688.2 MB/s (0.3%) 

rep movsb方法仍然比non-temporal memcpy慢,但在这里只有大约14%(相比Skylaketesting中的大约26%)。 NT技术在其时间表兄弟之上的优势现在约为57%,甚至比带宽减less的理论上的好处还多一点。

什么时候应该使用rep movs

最后刺伤你的实际问题:何时或为什么要使用它? 它借鉴了以上几点,并提出了一些新的思路。 不幸的是,没有简单的答案:你必须权衡各种因素,包括你可能甚至不知道的一些因素,比如未来的发展。

请注意, rep movsb的替代scheme可能是优化的libc memcpy (包括编译器内联的副本),也可能是手动memcpy版本。 下面的一些好处只适用于这些替代scheme中的一个或另一个(例如,“简单”有助于防止手动滚动的版本,而不是内置的memcpy ),但有些适用于两者。

限制可用的说明

在某些环境中,某些指令或使用某些寄存器是有限制的。 例如,在Linux内核中,通常不允许使用SSE / AVX或FP寄存器。 因此,大多数优化的memcpy变体不能使用,因为它们依赖于SSE或AVX寄存器,并且在x86上使用纯64位的基于mov的副本。 对于这些平台,使用rep movsb可以实现优化memcpy大部分性能,而不会破坏对SIMD代码的限制。

一个更一般的例子可能是代码必须针对多代硬件,而不使用硬件特定的调度(例如,使用cpuid )。 在这里,你可能会被迫只使用老的指令集,这就排除了任何AVX,等等rep movsb可能是一个很好的方法,因为它允许在不使用新指令的情况下“隐藏”访问更大的加载和存储。 如果你的目标是ERMSB硬件,那么你必须看看rep movsb性能是否可以接受。

未来打样

rep movsb一个很好的方面就是它可以在理论上利用未来架构上的架构改进,而不需要改变源代码,这种明确的移动不能。 例如,当引入256位数据path时, rep movsb能够利用它们(如英特尔声称),而无需对软件进行任何更改。 使用128位移动的软件(在Haswell之前是最佳的)将不得不被修改和重新编译。

因此,这既是软件维护的好处(无需更改源代码),也可以利用现有的二进制文件(无需部署新的二进制文件即可利用此改进function)。

这是多么重要取决于你的维护模式(例如,新的二进制文件在实际中的使用频率),而且很难判断这些指令在未来可能会有多快。 至less英特尔在这个方向上是有指导用途的,至less在未来承诺合理的性能( 15.3.3.6 ):

REP MOVSB和REP STOSB将继续在未来的处理器上performance良好。

与随后的工作重叠

这个好处不会显示在一个简单的memcpy基准当然,根据定义没有后续的工作重叠,所以在现实世界的情况下,必须认真衡量好处的重要性。 充分利用可能需要重新组织围绕memcpy的代码。

英特尔在其优化手册(第11.16.3.4节)中指出了这一好处,用他们的话来说:

当已知计数至less为一千字节或更多时,使用增强型REP MOVSB / STOSB可以提供另一个优势,以分摊非耗费代码的成本。 启发式可以使用Cnt = 4096和memset()的值来理解,例如:

•memset()的256位SIMD实现需要在非消耗指令序列退出之前,通过VMOVDQA发出/执行退出128个32字节存储操作的实例。

•ECX = 4096的增强型REP STOSB实例被解码为由硬件提供的长时间微操作stream,但作为一条指令退出。 在使用memset()的结果之前,有许多store_data操作必须完成。 因为存储数据操作的完成与程序订单退出分离,所以非消费码stream的相当大一部分可以通过发布/执行和报废进行处理,如果非消费序列不竞争用于存储缓冲区资源。

所以英特尔说,毕竟有些代码是在rep movsb发布之后的代码,但是很多商店仍然在运行,整个rep movsb还没有退休,从下面的指令中可以看出,如果代码是在复制循环之后发生的,那么它们的订单机制就比它们所能做到

显式加载和存储循环中的uops都必须按程序顺序实际单独退出。 那必须发生在ROB的空间跟随uops。

确切地说,似乎没有关于像rep movsb这样的微编码指令如何工作的详细信息。 我们不知道微代码分支是如何从微代码序列器请求不同的uopsstream的,或者uops如何退出。 如果单个微处理器不必分开退出,也许整个指令只占用ROB中的一个时隙。

当提供OoO机器的前端在uopcaching中看到一个rep movsb指令时,它会激活微码定序器ROM(MS-ROM),将微码uops发送到发送/重命名阶段的队列中。 在rep movsb仍在发行的时候,其他任何uops都不可能与其他的uops混合在一起,并且发出/执行8 ,但是后续的指令可以被获取/解码并且在最后一个rep movsb uop之后立即发出,而一些副本hasn还没有执行。 这只有在你的后续代码至less有一部分不依赖于memcpy的结果时才有用(这并不罕见)。

现在,这个好处的大小是有限的:至多你可以执行超出慢速rep movsb指令的N个指令(实际上是uops),此时你将停止,其中N是ROB的大小 。 当前的ROB大小约为200(Haswell为192,Skylake为224),对于IPC为1的后续代码来说,这样做的最大收益为200个周期的自由工作。在200个周期内,您可以在10 GB / s,所以对于这种大小的副本,您可以免费获得接近副本的成本(以免费副本的方式)。

但是,随着副本大小的增加,相对重要性会迅速下降(例如,如果复制80 KB,则免费工作仅为复制成本的1%)。 不过,中等大小的副本是非常有趣的。

复制循环不会完全阻止执行的后续指令。 英特尔没有详细介绍这个优势的大小,或者是哪种types的副本或者代码是最有利的。 (热或冷目的地或源,高ILP或低ILP高延迟代码)。

代码大小

与典型的优化memcpy例程相比,执行的代码大小(几个字节)是微观的。 如果性能完全受限于i-cache(包括uop cache)的缺失,那么缩小的代码大小可能是有益的。

再次,我们可以根据副本的大小来限制这种好处的大小。 实际上我不会用数字来解决这个问题,但直觉是,将dynamic代码大小减lessB字节可以节省至多C * Bcaching缺失,对于某些常量。每次调用 memcpy导致caching缺失成本(或好处)一次,但吞吐量更高的优势随着复制的字节数而变化。 所以对于大的传输,更高的吞吐量将主宰caching效应。

再一次,这不是一个纯粹的基准testing,整个循环无疑会放在uopcaching中。 你需要一个现实世界的就地testing来评估这个效果。

体系结构特定优化

你在硬件上报告说, rep movsb比平台的memcpy要慢很多。 但是,即使在这里,在早期的硬件上也有相反的结果(如Ivy Bridge)。

这是完全合理的,因为看起来,string移动操作定期获得爱情,但不是每一代,所以它可能会更快,或者至less在它已经带来了最新的,只是落后于后来的硬件。

引用 Andy Glew,在P6上实现这些之后,谁应该知道这件事呢?

在微代码中做快速string的最大弱点是微代码与每一代都失调,越来越慢,直到有人开始修复它。 就像一个图书馆男人复制失调。 我想假设有一个错过的机会是当它们变得可用时使用128位的加载和存储,等等。

在这种情况下,它可以被看作是在标准库和JIT编译器中find的典型的每一个技巧书中的memcpy例程中应用的另一个“特定于平台”的优化:但是只能用于它所在的体系结构更好。 对于JIT或AOT编译的东西来说,这很容易,但对于静态编译的二进制文件,这确实需要特定于平台的调度,但通常已经存在(有时在链接时执行),或者可以使用mtune参数做出静态决定。

简单

即使在Skylake,它似乎已经落后于绝对最快的非暂时性技术,但仍然比大多数方法快,而且非常简单 。 这意味着更less的validation时间,更less的神秘错误,更less的时间调整和更新怪兽memcpy实现(或者,相反,如果依赖的话,更less依赖于标准库实现者的奇思妙想)。

延迟绑定平台

内存吞吐量限制algorithm9实际上可以在两个主要的整体机制中操作:DRAM带宽限制或并发性/延迟限制。

第一种模式是您可能熟悉的模式:DRAM子系统具有一定的理论带宽,您可以根据通道数量,数据速率/宽度和频率轻松计算出来。 例如,我的双通道DDR4-2133系统的最大带宽为2.133 * 8 * 2 = 34.1 GB / s,与ARK报告的相同。

在整个插槽上增加的内核(即单插槽系统的全局限制)不会超过DRAM的速率(通常由于各种低效)。

另一个限制是由核心实际发​​送到内存子系统的并发请求的数量决定的。 想象一下,如果一个核心只能同时进行一个请求,那么对于一个64字节的caching行 – 当请求完成时,可以发出另一个请求。 假设50ns的内存延迟也非常快。 那么尽pipe有大的34.1 GB / s DRAM带宽,实际上你只能得到64字节/ 50 ns = 1.28 GB / s,或者小于最大带宽的4%。

在实践中,内核一次可以发出多个请求,但不限制数量。 通常可以理解,在L1和内存层次结构的其余部分之间,每个核心只有10个线路填充缓冲区 ,在L2和DRAM之间可能有16个左右的填充缓冲区。 预取竞争相同的资源,但至less有助于减less有效延迟。 有关更多详细信息,请参阅Bandwidth博士撰写的关于该主题的任何重要post,主要是在英特尔论坛上。

但是,最新的CPU受限于这个因素,而不是RAM的带宽。 通常他们每个内核达到12-20GB / s,而RAM带宽可能达到50+ GB / s(在4通道系统上)。 只有一些最新的双通道“客户端”内核似乎拥有更好的uncore,也许更多的线缓冲器可以达到单个内核的DRAM限制,我们的Skylake芯片似乎是其中之一。

当然,英特尔devise的系统具有50GB / s的DRAM带宽是有原因的,而由于并发限制,每个核心只能维持在20GB / s以下:前者的限制是套接字范围,后者是每个核心。 因此,8核心系统上的每个内核可以推送20 GB / s的请求,此时它们将再次受到DRAM的限制。

为什么我要继续这个呢? 因为最好的memcpy实现通常取决于你使用的是哪个机制。一旦你是DRAM BW有限的(因为我们的芯片显然是,但大多数不在一个核心),使用非暂时写入变得非常重要,因为它节省读取所有权通常会浪费您的带宽的三分之一。 您可以看到上面的testing结果: 使用NT存储的memcpy实现失去1/3的带宽。

然而,如果你的并发性有限,情况就会平衡,有时甚至会逆转。 你有DRAM的带宽要腾出来,所以NT商店没有帮助,他们甚至可以伤害,因为他们可能会增加延迟,因为行缓冲区的切换时间可能比预取带来RFO线LLC(或甚至L2),然后商店在LLC中完成,以获得有效的较低延迟。 最后, 服务器 uncores往往比客户端(和高带宽)有更慢的NT存储,这突出了这种影响。

所以在其他平台上,您可能会发现NT商店的用处不大(至less当您关心单线程性能时),也许rep movsb在哪里(如果它获得了两全其美)。

真的,最后一个项目是大多数testing的呼叫。 我知道NT商店在大多数arch(包括当前的服务器arch)上都失去了单线程testing的明显优势,但我不知道rep movsb会如何执行相对…

参考

其他好的信息来源没有集成在上面。

comp.arch rep movsb与替代品的调查 。 关于分支预测的很多好消息,以及我经常为小块提出的方法的一个实现:使用重叠的第一个和/或最后一个读/写,而不是试图只写所需的字节数(例如,实现所有从9到16字节的副本作为两个8字节副本,可能会重叠到7字节)。


1大概的意图是将其限制在例如代码大小非常重要的情况下。

2第3.7.5节: REP前缀和数据移动。

3关键是要注意,这仅适用于单一指令本身内的各种商店:一旦完成,商店的块仍相对于先前和随后的商店出现有序。 因此,代码可以看到rep movs商店相对于其他商店而不是相对于先前或随后的商店(这是后者通常需要的保证)。 如果您将复制目标的结尾用作同步标志,而不是单独的存储,则只会出现问题。

4注意,非时间离散商店也避免了大部分的订货要求,但实际上,由于在WC / NT商店仍然存在一些订货限制,因此rep movs有更多的自由。

5在32位时代的后半部分,很多芯片都有64位数据path(例如,支持支持64位double精度型的FPU),这是很常见的。 今天,Pentium或Celeron等品牌的“中性”芯片已禁用AVX,但推测代码微码仍然可以使用256b的加载/存储。

6例如,由于语言alignment规则,alignment属性或操作符,别名规则或在编译时确定的其他信息。 在alignment的情况下,即使无法确定精确alignment,它们也可能至less能够将alignment检查提升出循环或以其他方式消除冗余检查。

7我假设“标准” memcpy正在select一个非暂时性的方法,这对于这个缓冲区大小是非常有可能的。

8这并不一定是明显的,因为rep movsb生成的uopstream可能简单地垄断了dispatch,然后看起来非常像显式mov情况。 看起来,它不能像这样工作 – 从后续指令中的uops可以与来自微代码rep movsb混合在一起。

也就是说,那些可以发出大量独立的内存请求,并因此使可用的DRAM到核心带宽饱和,其中memcpy将是一个招贴儿童(以及像纯粹的延时绑定负载,如指针追逐)。

You say that you want:

an answer that shows when ERMSB is useful

But I'm not sure it means what you think it means. Looking at the 3.7.6.1 docs you link to, it explicitly says:

implementing memcpy using ERMSB might not reach the same level of throughput as using 256-bit or 128-bit AVX alternatives, depending on length and alignment factors.

So just because CPUID indicates support for ERMSB, that isn't a guarantee that REP MOVSB will be the fastest way to copy memory. It just means it won't suck as bad as it has in some previous CPUs.

However just because there may be alternatives that can, under certain conditions, run faster doesn't mean that REP MOVSB is useless. Now that the performance penalties that this instruction used to incur are gone, it is potentially a useful instruction again.

Remember, it is a tiny bit of code (2 bytes!) compared to some of the more involved memcpy routines I have seen. Since loading and running big chunks of code also has a penalty (throwing some of your other code out of the cpu's cache), sometimes the 'benefit' of AVX et al is going to be offset by the impact it has on the rest of your code. Depends on what you are doing.

You also ask:

Why is the bandwidth so much lower with REP MOVSB? What can I do to improve it?

It isn't going to be possible to "do something" to make REP MOVSB run any faster. It does what it does.

If you want the higher speeds you are seeing from from memcpy, you can dig up the source for it. It's out there somewhere. Or you can trace into it from a debugger and see the actual code paths being taken. My expectation is that it's using some of those AVX instructions to work with 128 or 256bits at a time.

Or you can just… Well, you asked us not to say it.

This is not an answer to the stated question(s), only my results (and personal conclusions) when trying to find out.

In summary: GCC already optimizes memset() / memmove() / memcpy() (see eg gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). So, there is no reason to expect massive gains by using your own variant with GCC (unless you've forgotten important stuff like alignment attributes for your aligned data, or do not enable sufficiently specific optimizations like -O2 -march= -mtune= ). If you agree, then the answers to the stated question are more or less irrelevant in practice.

(I only wish there was a memrepeat() , the opposite of memcpy() compared to memmove() , that would repeat the initial part of a buffer to fill the entire buffer.)


I currently have an Ivy Bridge machine in use (Core i5-6200U laptop, Linux 4.4.0 x86-64 kernel, with erms in /proc/cpuinfo flags). Because I wanted to find out if I can find a case where a custom memcpy() variant based on rep movsb would outperform a straightforward memcpy() , I wrote an overly complicated benchmark.

The core idea is that the main program allocates three large memory areas: original , current , and correct , each exactly the same size, and at least page-aligned. The copy operations are grouped into sets, with each set having distinct properties, like all sources and targets being aligned (to some number of bytes), or all lengths being within the same range. Each set is described using an array of src , dst , n triplets, where all src to src+n-1 and dst to dst+n-1 are completely within the current area.

A Xorshift* PRNG is used to initialize original to random data. (Like I warned above, this is overly complicated, but I wanted to ensure I'm not leaving any easy shortcuts for the compiler.) The correct area is obtained by starting with original data in current , applying all the triplets in the current set, using memcpy() provided by the C library, and copying the current area to correct . This allows each benchmarked function to be verified to behave correctly.

Each set of copy operations is timed a large number of times using the same function, and the median of these is used for comparison. (In my opinion, median makes the most sense in benchmarking, and provides sensible semantics — the function is at least that fast at least half the time.)

To avoid compiler optimizations, I have the program load the functions and benchmarks dynamically, at run time. The functions all have the same form, void function(void *, const void *, size_t) — note that unlike memcpy() and memmove() , they return nothing. The benchmarks (named sets of copy operations) are generated dynamically by a function call (that takes the pointer to the current area and its size as parameters, among others).

Unfortunately, I have not yet found any set where

 static void rep_movsb(void *dst, const void *src, size_t n) { __asm__ __volatile__ ( "rep movsb\n\t" : "+D" (dst), "+S" (src), "+c" (n) : : "memory" ); } 

would beat

 static void normal_memcpy(void *dst, const void *src, size_t n) { memcpy(dst, src, n); } 

using gcc -Wall -O2 -march=ivybridge -mtune=ivybridge using GCC 5.4.0 on aforementioned Core i5-6200U laptop running a linux-4.4.0 64-bit kernel. Copying 4096-byte aligned and sized chunks comes close, however.

This means that at least thus far, I have not found a case where using a rep movsb memcpy variant would make sense. It does not mean there is no such case; I just haven't found one.

(At this point the code is a spaghetti mess I'm more ashamed than proud of, so I shall omit publishing the sources unless someone asks. The above description should be enough to write a better one, though.)


This does not surprise me much, though. The C compiler can infer a lot of information about the alignment of the operand pointers, and whether the number of bytes to copy is a compile-time constant, a multiple of a suitable power of two. This information can, and will/should, be used by the compiler to replace the C library memcpy() / memmove() functions with its own.

GCC does exactly this (see eg gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). Indeed, memcpy() / memset() / memmove() has already been separately optimized for quite a few x86 processor variants; it would quite surprise me if the GCC developers had not already included erms support.

GCC provides several function attributes that developers can use to ensure good generated code. For example, alloc_align (n) tells GCC that the function returns memory aligned to at least n bytes. An application or a library can choose which implementation of a function to use at run time, by creating a "resolver function" (that returns a function pointer), and defining the function using the ifunc (resolver) attribute.

One of the most common patterns I use in my code for this is

 some_type *pointer = __builtin_assume_aligned(ptr, alignment); 

where ptr is some pointer, alignment is the number of bytes it is aligned to; GCC then knows/assumes that pointer is aligned to alignment bytes.

Another useful built-in, albeit much harder to use correctly , is __builtin_prefetch() . To maximize overall bandwidth/efficiency, I have found that minimizing latencies in each sub-operation, yields the best results. (For copying scattered elements to consecutive temporary storage, this is difficult, as prefetching typically involves a full cache line; if too many elements are prefetched, most of the cache is wasted by storing unused items.)

Enhanced REP MOVSB (Ivy Bridge and later)

Ivy Bridge microarchitecture (processors released in 2012 and 2013) introduced Enhanced REP MOVSB (we still need to check the corresponding bit) and allowed us to copy memory fast.

Cheapest versions of later processors – Kaby Lake Celeron and Pentium, released in 2017, don't have AVX that could have been used for fast memory copy, but still have the Enhanced REP MOVSB.

REP MOVSB (ERMSB) is only faster than AVX copy or general-use register copy if the block size is at least 256 bytes. For the blocks below 64 bytes, it is MUCH slower, because there is high internal startup in ERMSB – about 35 cycles.

See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) UTF-8''248966_software_optimization_manual.pdf

  • startup cost is 35 cycles;
  • both the source and destination addresses have to be aligned to a 16-Byte boundary;
  • the source region should not overlap with the destination region;
  • the length have to be a multiple of 64 to produce higher performance;
  • the direction have to be forward (CLD).

As I said earlier, REP MOVSB begin to outperform other methods when the length is at least 256 bytes, but to see the clear benefit over AVX copy, the length have to be more than 2048 bytes.

On the effect of the alignment if REP MOVSB vs. AVX copy, the Intel Manual gives the following information:

  • if the source buffer is not aligned, the impact on ERMSB implementation versus 128-bit AVX is similar;
  • if the destination buffer is not aligned, the impact on ERMSB implementation can be 25% degradation, while 128-bit AVX implementation of memcpy may degrade only 5%, relative to 16-byte aligned scenario.

I have made tests on Intel Core i5-6600, under 64-bit, and I have compared REP MOVSB memcpy() with a simple MOV RAX, [SRC]; MOV [DST], RAX implementation when the data fits L1 cache :

REP MOVSB memcpy():

  - 1622400000 data blocks of 32 bytes took 17.9337 seconds to copy; 2760.8205 MB/s - 1622400000 data blocks of 64 bytes took 17.8364 seconds to copy; 5551.7463 MB/s - 811200000 data blocks of 128 bytes took 10.8098 seconds to copy; 9160.5659 MB/s - 405600000 data blocks of 256 bytes took 5.8616 seconds to copy; 16893.5527 MB/s - 202800000 data blocks of 512 bytes took 3.9315 seconds to copy; 25187.2976 MB/s - 101400000 data blocks of 1024 bytes took 2.1648 seconds to copy; 45743.4214 MB/s - 50700000 data blocks of 2048 bytes took 1.5301 seconds to copy; 64717.0642 MB/s - 25350000 data blocks of 4096 bytes took 1.3346 seconds to copy; 74198.4030 MB/s - 12675000 data blocks of 8192 bytes took 1.1069 seconds to copy; 89456.2119 MB/s - 6337500 data blocks of 16384 bytes took 1.1120 seconds to copy; 89053.2094 MB/s 

MOV RAX… memcpy():

  - 1622400000 data blocks of 32 bytes took 7.3536 seconds to copy; 6733.0256 MB/s - 1622400000 data blocks of 64 bytes took 10.7727 seconds to copy; 9192.1090 MB/s - 811200000 data blocks of 128 bytes took 8.9408 seconds to copy; 11075.4480 MB/s - 405600000 data blocks of 256 bytes took 8.4956 seconds to copy; 11655.8805 MB/s - 202800000 data blocks of 512 bytes took 9.1032 seconds to copy; 10877.8248 MB/s - 101400000 data blocks of 1024 bytes took 8.2539 seconds to copy; 11997.1185 MB/s - 50700000 data blocks of 2048 bytes took 7.7909 seconds to copy; 12710.1252 MB/s - 25350000 data blocks of 4096 bytes took 7.5992 seconds to copy; 13030.7062 MB/s - 12675000 data blocks of 8192 bytes took 7.4679 seconds to copy; 13259.9384 MB/s 

So, even on 128-bit blocks, REP MOVSB is slower than just a simple MOV RAX copy in a loop (not unrolled). The ERMSB implementation begins to outperform the MOV RAX loop only starting form 256-byte blocks.

Normal (not enhanced) REP MOVS on Nehalem and later

Surprisingly, previous architectures (Nehalem and later), that didn't yet have Enhanced REP MOVB, had quite fast REP MOVSD/MOVSQ (but not REP MOVSB/MOVSW) implementation for large blocks, but not large enough to outsize the L1 cache.

Intel Optimization Manual (2.5.6 REP String Enhancement) gives the following information is related to Nehalem microarchitecture – Intel Core i5, i7 and Xeon processors released in 2009 and 2010.

REP MOVSB

The latency for MOVSB, is 9 cycles if ECX < 4; otherwise REP MOVSB with ECX > 9 have a 50-cycle startup cost.

  • tiny string (ECX < 4): the latency of REP MOVSB is 9 cycles;
  • small string (ECX is between 4 and 9): no official information in the Intel manual, probably more than 9 cycles but less than 50 cycles;
  • long string (ECX > 9): 50-cycle startup cost.

My conclusion: REP MOVSB is almost useless on Nehalem.

MOVSW/MOVSD/MOVSQ

Quote from the Intel Optimization Manual (2.5.6 REP String Enhancement):

  • Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles.
  • Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-byte data transfer spans across cache line boundary: = Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds 4 cycles. = Cache splits: the latency consists of a startup cost of about 35 cycles and each 64 bytes of data adds 6 cycles.
  • Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles plus one cycle for each iteration of the data movement in word/dword/qword.

Intel does not seem to be correct here. From the above quote we understand that for very large memory blocks, REP MOVSW is as fast as REP MOVSD/MOVSQ, but tests have shown that only REP MOVSD/MOVSQ are fast, while REP MOVSW is even slower than REP MOVSB on Nehalem and Westmere.

According to the information provided by Intel in the manual, on previous Intel microarchitectures (before 2008) the startup costs are even higher.

Conclusion: if you just need to copy data that fits L1 cache, just 4 cycles to copy 64 bytes of data is excellent, and you don't need to use XMM registers!

REP MOVSD/MOVSQ is the universal solution that works excellent on all Intel processors (no ERMSB required) if the data fits L1 cache

Here are the tests of REP MOVS* when the source and destination was in the L1 cache, of blocks large enough to not be seriously affected by startup costs, but not that large to exceed the L1 cache size. Source: http://users.atw.hu/instlatx64/

Yonah (2006-2008)

  REP MOVSB 10.91 B/c REP MOVSW 10.85 B/c REP MOVSD 11.05 B/c 

Nehalem (2009-2010)

  REP MOVSB 25.32 B/c REP MOVSW 19.72 B/c REP MOVSD 27.56 B/c REP MOVSQ 27.54 B/c 

Westmere (2010-2011)

  REP MOVSB 21.14 B/c REP MOVSW 19.11 B/c REP MOVSD 24.27 B/c 

Ivy Bridge (2012-2013) – with Enhanced REP MOVSB

  REP MOVSB 28.72 B/c REP MOVSW 19.40 B/c REP MOVSD 27.96 B/c REP MOVSQ 27.89 B/c 

SkyLake (2015-2016) – with Enhanced REP MOVSB

  REP MOVSB 57.59 B/c REP MOVSW 58.20 B/c REP MOVSD 58.10 B/c REP MOVSQ 57.59 B/c 

Kaby Lake (2016-2017) – with Enhanced REP MOVSB

  REP MOVSB 58.00 B/c REP MOVSW 57.69 B/c REP MOVSD 58.00 B/c REP MOVSQ 57.89 B/c 

As you see, the implementation of REP MOVS differs significantly from one microarchitecture to another. On some processors, like Ivy Bridge – REP MOVSB is fastest, albeit just slightly faster than REP MOVSD/MOVSQ, but no doubt that on all processors since Nehalem, REP MOVSD/MOVSQ works very well – you even don't need "Enhanced REP MOVSB", since, on Ivy Bridge (2013) with Enhacnced REP MOVSB , REP MOVSD shows the same byte per clock data as on Nehalem (2010) without Enhacnced REP MOVSB , while in fact REP MOVSB became very fast only since SkyLake (2015) – twice as fast as on Ivy Bridge. So this Enhacnced REP MOVSB bit in the CPUID may be confusing – it only shows that REP MOVSB per se is OK, but not that any REP MOVS* is faster.

The most confusing ERMBSB implementation is on the Ivy Bridge microarchitecture. Yes, on very old processors, before ERMSB, REP MOVS* for large blocks did use a cache protocol feature that is not available to regular code (no-RFO). But this protocol is no longer used on Ivy Bridge that has ERMSB. According to Andy Glew's comments on an answer to "why are complicated memcpy/memset superior?" from a Peter Cordes answer , a cache protocol feature that is not available to regular code was once used on older processors, but no longer on Ivy Bridge. And there comes an explanation of why the startup costs are so high for REP MOVS*: „The large overhead for choosing and setting up the right method is mainly due to the lack of microcode branch prediction”. There has also been an interesting note that Pentium Pro (P6) in 1996 implemented REP MOVS* with 64 bit microcode loads and stores and a no-RFO cache protocol – they did not violate memory ordering, unlike ERMSB in Ivy Bridge.

放弃

  1. This answer is only relevant for the cases where the source and the destination data fits L1 cache. Depending on circumstances, the particularities of memory access (cache, etc.) should be taken into consideration. Prefetch and NTI may give better results in certain cases, especially on the processors that didn't yet have the Enhanced REP MOVSB. Even on these older processors, REP MOVSD might have used a cache protocol feature that is not available to regular code.
  2. The information in this answer is only related to Intel processors and not to the processors by other manufacturers like AMD that may have better or worse implementations of REP MOVS* instructions.
  3. I have presented test results for both SkyLake and Kaby Lake just for the sake of confirmation – these architectures have the same cycle-per-instruction data.
  4. All product names, trademarks and registered trademarks are property of their respective owners.

There are far more efficient ways to move data. These days, the implementation of memcpy will generate architecture specific code from the compiler that is optimized based upon the memory alignment of the data and other factors. This allows better use of non-temporal cache instructions and XMM and other registers in the x86 world.

When you hard-code rep movsb prevents this use of intrinsics.

Therefore, for something like a memcpy , unless you are writing something that will be tied to a very specific piece of hardware and unless you are going to take the time to write a highly optimized memcpy function in assembly (or using C level intrinsics), you are far better off allowing the compiler to figure it out for you.

As a general memcpy() guide:

a) If the data being copied is tiny (less than maybe 20 bytes) and has a fixed size, let the compiler do it. Reason: Compiler can use normal mov instructions and avoid the startup overheads.

b) If the data being copied is small (less than about 4 KiB) and is guaranteed to be aligned, use rep movsb (if ERMSB is supported) or rep movsd (if ERMSB is not supported). Reason: Using an SSE or AVX alternative has a huge amount of "startup overhead" before it copies anything.

c) If the data being copied is small (less than about 4 KiB) and is not guaranteed to be aligned, use rep movsb . Reason: Using SSE or AVX, or using rep movsd for the bulk of it plus some rep movsb at the start or end, has too much overhead.

d) For all other cases use something like this:

  mov edx,0 .again: pushad .nextByte: pushad popad mov al,[esi] pushad popad mov [edi],al pushad popad inc esi pushad popad inc edi pushad popad loop .nextByte popad inc edx cmp edx,1000 jb .again 

Reason: This will be so slow that it will force programmers to find an alternative that doesn't involve copying huge globs of data; and the resulting software will be significantly faster because copying large globs of data was avoided.