内联汇编语言比本机C ++代码慢吗？

我试图比较内联汇编语言和C ++代码的性能，所以我编写了一个函数来添加两个大小为2000的数组100000次。代码如下：

#define TIMES 100000 void calcuC(int *x,int *y,int length) { for(int i = 0; i < TIMES; i++) { for(int j = 0; j < length; j++) x[j] += y[j]; } } void calcuAsm(int *x,int *y,int lengthOfArray) { __asm { mov edi,TIMES start: mov esi,0 mov ecx,lengthOfArray label: mov edx,x push edx mov eax,DWORD PTR [edx + esi*4] mov edx,y mov ebx,DWORD PTR [edx + esi*4] add eax,ebx pop edx mov [edx + esi*4],eax inc esi loop label dec edi cmp edi,0 jnz start }; }

这里是main() ：

 int main() { bool errorOccured = false; setbuf(stdout,NULL); int *xC,*xAsm,*yC,*yAsm; xC = new int[2000]; xAsm = new int[2000]; yC = new int[2000]; yAsm = new int[2000]; for(int i = 0; i < 2000; i++) { xC[i] = 0; xAsm[i] = 0; yC[i] = i; yAsm[i] = i; } time_t start = clock(); calcuC(xC,yC,2000); // calcuAsm(xAsm,yAsm,2000); // for(int i = 0; i < 2000; i++) // { // if(xC[i] != xAsm[i]) // { // cout<<"xC["<<i<<"]="<<xC[i]<<" "<<"xAsm["<<i<<"]="<<xAsm[i]<<endl; // errorOccured = true; // break; // } // } // if(errorOccured) // cout<<"Error occurs!"<<endl; // else // cout<<"Works fine!"<<endl; time_t end = clock(); // cout<<"time = "<<(float)(end - start) / CLOCKS_PER_SEC<<"\n"; cout<<"time = "<<end - start<<endl; return 0; }

然后我运行这个程序五次来获得处理器的周期，这可以看作是时间。每次我只打上面提到的function之一。

结果是这里。

组装版本的function：

 Debug Release --------------- 732 668 733 680 659 672 667 675 684 694 Average: 677

C ++版本的function：

 Debug Release ----------------- 1068 168 999 166 1072 231 1002 166 1114 183 Average: 182

在发布模式下的C ++代码比汇编代码快3.7倍。为什么？

我想我写的汇编代码不如GCC生成的代码那么有效。像我这样的普通程序员编写代码比编译器生成的对手更快，这是否意味着我不应该相信我手写的汇编语言的性能，专注于C ++，忘记汇编语言？

是的，大多数时候。

首先，你从一个错误的假设开始，即低级语言（在这种情况下是汇编）将总是比高级语言（在这种情况下是C ++和C）产生更快的代码。这不是真的。 C代码总是比Java代码快吗？不，因为还有另一个variables：程序员。您编写代码和架构细节知识的方式会极大地影响性能（如您在本例中所见）。

你总是可以创build一个手工汇编代码比编译代码更好的例子，但是通常它是一个虚构的例子或者一个单一的例程，而不是真正的500,000多行C ++代码）。我认为编译器会产生更好的汇编代码95％的次数， 有时候，只有极less数情况下，您可能需要编写less量的，短的，高度使用的性能关键例程的汇编代码，或者当您必须访问您喜欢的高级语言不公开。你想要一点复杂的？在这里阅读这个令人敬畏的答案。

为什么这个？

首先，因为编译器可以做我们甚至无法想象的优化（见这个简短列表），他们会在几秒钟内 （当我们可能需要几天）做到这一点。

当你在程序集中编码时，你必须用一个定义良好的调用接口来定义明确的函数。但是它们可以考虑整个程序的优化和程序间优化，如寄存器分配，常量传播，常见子expression式消除，指令调度等复杂的，不明显的优化（例如Polytope模型）。在RISC体系结构上，很多年前，人们不再担心这个问题（例如指令调度很难手工调整），现代CISC CPU也有很长的pipe道。

对于一些复杂的微控制器，即使系统库是用C语言编写的，而不是汇编语言，因为它们的编译器产生了一个更好的（易于维护的）最终代码。

编译器有时可以自动使用一些MMX / SIMDx指令，如果你不使用它们，你根本无法比较（其他答案已经很好地检查了你的汇编代码）。只是为了循环，这是一个简短的循环优化列表，通常由编译器检查什么（当你的计划已经决定了C＃程序时，你认为自己可以这样做吗？）如果你写了一些东西，认为你必须考虑至less一些简单的优化。数组的教科书示例是展开循环（其大小在编译时已知）。做到这一点，再次运行你的testing。

现在，由于另一个原因，需要使用汇编语言也是不常见的：大量不同的CPU 。你想支持他们吗？每个都有一个特定的微体系结构和一些特定的指令集。他们有不同数量的function单元，应该安排汇编指令来保持它们的繁忙。如果你用C编写，你可以使用PGO，但是在汇编中，你需要对这个特定的架构有很好的了解（并且重新思考和重做所有的架构 ）。对于小型任务来说，编译器通常会做得更好，对于复杂的任务，通常工作不会被完成（编译器可能会做得更好）。

如果你坐下来看看你的代码，你可能会发现你将会重新devise你的algorithm，而不是翻译成汇编（在这里阅读这篇伟大的文章），还有高级的优化（和提示编译器），在需要使用汇编语言之前，可以有效地应用。可能值得一提的是，经常使用内在函数，您将会获得性能，编译器仍然可以执行大部分优化。

所有这一切，即使你能够产生5到10倍的汇编代码，你也应该问你的客户是否愿意花一个星期的时间来购买或者购买一个50美元的CPU 。更多情况下（尤其是在LOB应用程序中）通常不需要极大的优化。

你的汇编代码~~特别差，~~稍微不理想，可能会有所改进：

您在内部循环中推送并popup一个registry（ EDX ）。这应该被移出循环。
您在循环的每个迭代中重新加载数组指针。这应该移出循环。
你可以使用loop指令，在大多数现代CPU上，这个loop指令已经很慢了（可能是因为使用了古老的程序集*）
你没有利用手动循环展开。
您不使用可用的SIMD指令。

所以除非你大大地提高了汇编程序的技能，否则编写性能汇编代码是没有意义的。

*当然，我不知道你是否真的从古代集会书中得到loop指令。但是你几乎从来没有在现实世界的代码中看到它，因为每个编译器都足够聪明，不会发出loop ，你只能看到它在恕我直言，糟糕和过时的书籍。

即使在深入研究汇编之前，也存在更高级别的代码转换。

 static int const TIMES = 100000; void calcuC(int *x, int *y, int length) { for (int i = 0; i < TIMES; i++) { for (int j = 0; j < length; j++) { x[j] += y[j]; } } }

可以转换成通过循环旋转：

 static int const TIMES = 100000; void calcuC(int *x, int *y, int length) { for (int j = 0; j < length; ++j) { for (int i = 0; i < TIMES; ++i) { x[j] += y[j]; } } }

就内存地方而言，这是好得多的。

这可以进一步优化，做a += b X次相当于做a += X * b所以我们得到：

 static int const TIMES = 100000; void calcuC(int *x, int *y, int length) { for (int j = 0; j < length; ++j) { x[j] += TIMES * y[j]; } }

但是看起来我最喜欢的优化器（LLVM）不执行这个转换。

我发现如果我们对x和y有restrict限定符，那么就执行了转换。事实上，如果没有这个限制， x[j]和y[j]可能会混淆到相同的位置，导致这种转换错误。 [结束编辑]

无论如何，我认为这是优化的C版本。现在已经很简单了。基于这个，这里是我对ASM的破解（我让Clang生成它，我没用它）：

 calcuAsm: # @calcuAsm .Ltmp0: .cfi_startproc # BB#0: testl %edx, %edx jle .LBB0_2 .align 16, 0x90 .LBB0_1: # %.lr.ph # =>This Inner Loop Header: Depth=1 imull $100000, (%rsi), %eax # imm = 0x186A0 addl %eax, (%rdi) addq $4, %rsi addq $4, %rdi decl %edx jne .LBB0_1 .LBB0_2: # %._crit_edge ret .Ltmp1: .size calcuAsm, .Ltmp1-calcuAsm .Ltmp2: .cfi_endproc

恐怕我不明白所有这些指令来自哪里，但是你总是可以玩得开心，试着去看看它是如何比较的……但是我仍然使用优化的C版本而不是程序集，更便携。

简短的回答：是的。

长的回答：是的，除非你真的知道你在做什么，并有一个理由这样做。

我已经修复了我的代码：

  __asm { mov ebx,TIMES start: mov ecx,lengthOfArray mov esi,x shr ecx,1 mov edi,y label: movq mm0,QWORD PTR[esi] paddd mm0,QWORD PTR[edi] add edi,8 movq QWORD PTR[esi],mm0 add esi,8 dec ecx jnz label dec ebx jnz start };

发布版本的结果：

  Function of assembly version: 81 Function of C++ version: 161

在释放模式下的汇编代码几乎比C ++快2倍。

这是否意味着我不应该相信我手写的汇编语言的性能

是的，这正是它的意思，每种语言都是如此。如果你不知道如何用X语言编写高效的代码，那么你不应该相信你在X中编写高效的代码的能力。所以，如果你想要高效的代码，你应该使用另一种语言。

大会对此特别敏感，因为你看到的是你所得到的。您可以编写您希望CPU执行的特定指令。对于高级语言，有一个编译器，它可以转换你的代码并消除许多低效率。随着大会，你自己。

现在使用汇编语言的唯一原因是使用语言无法访问的一些function。

这适用于：

需要访问某些硬件function（例如MMU）的内核编程
高性能编程，使用编译器不支持的特定向量或多媒体指令。

但是现在的编译器非常聪明，甚至可以replace两个单独的语句，比如d = a / b; r = a % b; d = a / b; r = a % b; 如果可用，即使C没有这样的操作符，也可以用单一指令一次计算除法和余数。

确实，现代编译器在代码优化方面做得非常出色，但我仍然鼓励你继续学习汇编。

首先，你显然没有被它吓倒 ，这是一个伟大的优点 ，接下来，你通过分析来validation或放弃你的速度假设是正确的 ，你需要经验丰富的人的意见 ，而你拥有人类已知的最大优化工具：大脑。

随着经验的增加，您将学习何时何地使用它（通常是在algorithm级别进行深度优化之后，代码中最紧密，最内层的循环）。

为了获得灵感，我build议你查阅Michael Abrash的文章（如果你还没有听说过他，他是一个优化大师;他甚至与John Carmack合作来优化Quake软件渲染器）。

“有没有最快的代码” – 迈克尔Abrash

我已经改变了代码：

  __asm { mov ebx,TIMES start: mov ecx,lengthOfArray mov esi,x shr ecx,2 mov edi,y label: mov eax,DWORD PTR [esi] add eax,DWORD PTR [edi] add edi,4 dec ecx mov DWORD PTR [esi],eax add esi,4 test ecx,ecx jnz label dec ebx test ebx,ebx jnz start };

发布版本的结果：

  Function of assembly version: 41 Function of C++ version: 161

发布模式下的汇编代码几乎比C ++快4倍。 IMHo，汇编代码的速度取决于程序员

大多数高级语言编译器都非常优化，并且知道他们在做什么。您可以尝试转储反汇编代码，并将其与本地程序集进行比较。我相信你会看到你的编译器正在使用一些不错的技巧。

只是例如，即使我不知道它是正确的更:)：

这样做：

 mov eax,0

花费更多的周期比

 xor eax,eax

它做同样的事情。

编译器知道所有这些技巧并使用它们。

编译器击败了你。我会试一试，但我不会作任何保证。我将假定TIMES的“乘法”是为了使它成为一个更相关的性能testing， y和x是16alignment的，而且这个length是4的非零倍数。反正这也许是真的。

  mov ecx,length lea esi,[y+4*ecx] lea edi,[x+4*ecx] neg ecx loop: movdqa xmm0,[esi+4*ecx] paddd xmm0,[edi+4*ecx] movdqa [edi+4*ecx],xmm0 add ecx,4 jnz loop

就像我说的，我没有保证。但是如果能够做得更快，我会感到惊讶的 – 即使所有事情都是L1命中，这里的瓶颈仍然是内存吞吐量。

这是非常有趣的话题！
我在Sasha的代码中通过SSE更改了MMX
这是我的结果：

 Function of C++ version: 315 Function of assembly(simply): 312 Function of assembly (MMX): 136 Function of assembly (SSE): 62

SSE的汇编代码比C ++快5倍

只是盲目地实现完全相同的algorithm，汇编中的指令按指令保证比编译器能做的要慢。

这是因为即使编译器做的最小化的优化也比没有优化的刚性代码好。

当然，有可能击败编译器，特别是如果它是一个小的，本地化的代码的一部分，我甚至不得不自己做一个约。 4倍加速，但在这种情况下，我们必须严重依赖于良好的硬件知识和许多看似反直觉的技巧。

这正是它的意思。将微优化留给编译器。

我喜欢这个例子，因为它展示了一个关于底层代码的重要课程。是的，你可以编写和C代码一样快的程序集。这是同义的，但不一定意味着什么。显然有人可以，否则汇编不会知道适当的优化。

同样，当你走上语言抽象层次时，同样的原则也适用。是的，你可以使用C语言编写一个parsing器，就像一个快速又脏的perl脚本一样快，而且很多人都这么做。但这并不意味着你使用C，你的代码会很快。在很多情况下，高级语言会进行优化，您甚至可能从来没有考虑过。

作为一个编译器，我会用一个固定大小的循环replace很多执行任务。

 int a = 10; for (int i = 0; i < 3; i += 1) { a = a + i; }

会产生

 int a = 10; a = a + 0; a = a + 1; a = a + 2;

最终会知道“a = a + 0” 是无用的，所以它会删除这一行。希望您的脑海中有些东西愿意附加一些优化选项作为评论。所有这些非常有效的优化将使编译语言更快。

在许多情况下，执行某个任务的最佳方式可能取决于执行任务的上下文。如果一个例程是用汇编语言编写的，那么通常不可能根据上下文来改变指令的顺序。作为一个简单的例子，考虑以下简单的方法：

 inline void set_port_high(void) { (*((volatile unsigned char*)0x40001204) = 0xFF); }

一个32位ARM代码的编译器，如上所述，可能会使它如下所示：

 ldr r0,=0x40001204 mov r1,#0 strb r1,[r0] [a fourth word somewhere holding the constant 0x40001204]

也许

 ldr r0,=0x40001000 ; Some assemblers like to round pointer loads to multiples of 4096 mov r1,#0 strb r1,[r0+0x204] [a fourth word somewhere holding the constant 0x40001000]

这可以在手工组装的代码中稍微优化，如下所示：

 ldr r0,=0x400011FF strb r0,[r0+5] [a third word somewhere holding the constant 0x400011FF]

要么

 mvn r0,#0xC0 ; Load with 0x3FFFFFFF add r0,r0,#0x1200 ; Add 0x1200, yielding 0x400011FF strb r0,[r0+5]

这两种手工组装的方法将需要12个字节的代码空间，而不是16个; 后者将用“add”替代“load”，这将在ARM7-TDMI上执行两个周期。如果代码要在r0不知道/不关心的情况下执行，那么汇编语言版本会比编译版本好一些。另一方面，假设编译器知道某个寄存器[例如r5]要保存一个在所需地址0x40001204 [例如0x40001000]的2047字节内的值，并且还知道一些其他寄存器[例如r7]正在进行保存一个低位为0xFF的值。在这种情况下，编译器可以优化代码的C版本来简单地：

 strb r7,[r5+0x204]

比手工优化的汇编代码更短更快。此外，假设set_port_high发生在上下文中：

 int temp = function1(); set_port_high(); function2(temp); // Assume temp is not used after this

在编写embedded式系统时，一点也不合情理。如果set_port_high是用汇编代码编写的，那么在调用汇编代码之前，编译器必须将r0（其中包含function1的返回值）移到其他地方，然后将该值移回r0（因为function2将期望它的第一个参数在r0），因此“优化的”汇编代码需要5条指令。即使编译器不知道任何具有地址或存储值的寄存器，它的四指令版本（它可以适应使用任何可用寄存器 – 不一定是r0和r1）将击败“优化”组件语言版本。如果编译器在前面描述的r5和r7中有必要的地址和数据，那么function1不会改变这些寄存器，因此它可以用一个单独的strb指令代替set_port_high 四个指令比“手动优化”码。

请注意，在程序员知道精确的程序stream程的情况下，手动优化的汇编代码通常可以胜过编译器，但是在代码在其上下文已知之前被写入或者源代码可能是从多个上下文中调用[如果set_port_high被用在代码中的50个不同的地方，编译器可以独立地决定每一个如何最好地扩展它]。

总的来说，我认为汇编语言在那些可以从非常有限的上下文中访问每一段代码的情况下，能够产生最大的性能提升，并且在某些地方可能会损害性能code may be approached from many different contexts. Interestingly (and conveniently) the cases where assembly is most beneficial to performance are often those where the code is most straightforward and easy to read. The places that assembly language code would turn into a gooey mess are often those where writing in assembly would offer the smallest performance benefit.

[Minor note: there are some places where assembly code can be used to yield a hyper-optimized gooey mess; for example, one piece of code I did for the ARM needed to fetch a word from RAM and execute one of about twelve routines based upon the upper six bits of the value (many values mapped to the same routine). I think I optimized that code to something like:

 ldrh r0,[r1],#2! ; Fetch with post-increment ldrb r1,[r8,r0 asr #10] sub pc,r8,r1,asl #2

The register r8 always held the address of the main dispatch table (within the loop where the code spend 98% of its time, nothing ever used it for any other purpose); all 64 entries referred to addresses in the 256 bytes preceding it. Since the primary loop had in most cases a hard execution-time limit of about 60 cycles, the nine-cycle fetch and dispatch was very instrumental toward meeting that goal. Using a table of 256 32-bit addresses would have been one cycle faster, but would have gobbled up 1KB of very precious RAM [flash would have added more than one wait state]. Using 64 32-bit addresses would have required adding an instruction to mask off some bits from the fetched word, and would still have gobbled up 192 more bytes than the table I actually used. Using the table of 8-bit offsets yielded very compact and fast code, but not something I would expect a compiler would ever come up with; I also would not expect a compiler to dedicate a register "full time" to holding the table address.

The above code was designed to run as a self-contained system; it could periodically call C code, but only at certain times when the hardware with which it was communicating could safely be put into an "idle" state for two roughly-one-millisecond intervals every 16ms.

In recent times, all the speed optimisations that I have done were replacing brain damaged slow code with just reasonable code. But for things were speed was really critical and I put serious effort into making something fast, the result was always an iterative process, where each iteration gave more insight into the problem, finding ways how to solve the problem with fewer operations. The final speed always depended on how much insight I got into the problem. If at any stage I used assembly code, or C code that was over-optimised, the process of finding a better solution would have suffered and the end result would be slower.

All the answers here seem to exclude one aspect: sometimes we don't write code to achieve a specific aim, but for the sheer fun of it. It may not be economical to invest the time to do so, but arguably there is no greater satisfaction than beating the fastest compiler optimized code snippet in speed with a manually rolled asm alternative.

C++ is faster unless you are using assembly language with deeper knowledge with the correct way.

When I code in ASM, I reorganize the instructions manually so the CPU can execute more of them in parallel when logically possible. I barely use RAM when I code in ASM for example: There could be 20000+ lines of code in ASM and I not ever once used push/pop.

You could potentially jump in the middle of the opcode to self-modify the code and the behavior without the possible penalty of self-modifying code. Accessing registers takes 1 tick(sometimes takes .25 ticks) of the CPU.Accessing the RAM could take hundreds.

For my last ASM adventure, I never once used the RAM to store a variable(for thousands of lines of ASM). ASM could be potentially unimaginably faster than C++. But it depends on a lot of variable factors such as:

 1. I was writing my apps to run on the bare metal. 2. I was writing my own boot loader that was starting my programs in ASM so there was no OS management in the middle.

I am now learning C# and C++ because i realized productivity matters!! You could try to do the fastest imaginable programs using pure ASM alone in the free time. But in order to produce something, use some high level language.

For example, the last program I coded was using JS and GLSL and I never noticed any performance issue, even speaking about JS which is slow. This is because the mere concept of programming the GPU for 3D makes the speed of the language that sends the commands to the GPU almost irrelevant.

The speed of assembler alone on the bare metal is irrefutable. Could it be even slower inside C++? – It could be because you are writing assembly code with a compiler not using an assembler to start with.

My personal council is to never write assembly code if you can avoid it, even though I love assembly.

Assembly could be faster if your compiler generates a lot of OO support code.

编辑：

To downvoters: the OP wrote "should I … focus on C++ and forget about assembly language?" and I stand by my answer. You always need to keep an eye on the code OO generates, particularly when using methods. Not forgetting about assembly language means that you will periodically review the assembly your OO code generates which I believe is a must for writing well-performing software.

Actually, this pertains to all compileable code, not just OO.

内联汇编语言比本机C ++代码慢吗？

组装版本的function：

C ++版本的function：

Mongodb聚合框架比map / reduce更快吗？

性能的内置types：字符与短与整数与浮点数与双

加速IntelliJ-Idea

String.Contains（）比String.IndexOf（）更快吗？

是否有对10000客户端/秒问题的解决scheme进行现代审查

什么时候应该使用内联与外部Javascript？

现代C ++可以免费获得你的performance吗？

MySQL的“IN”操作符性能（数量很多）

Python或OpenCV的C ++编码之间的性能不同吗？

我应该在JavaScript链接中使用哪个“href”值，“＃”或“javascript：void（0）”？