汇编程序何时比C更快？

知道汇编程序的其中一个原因是有时它可以用来编写代码，而不是用高级语言编写代码，特别是C。不过，我也多次听到这样的说法，尽pipe这并不完全是错误的，汇编程序实际上可以用来生成更高性能代码的情况非常罕见，需要汇编程序的专业知识和经验。

这个问题甚至没有涉及到汇编指令是机器特定的，不可移植的，或汇编器的其他方面的事实。当然，除了这个之外，还有很多很好的理解汇编语言的理由，但是这是一个具体的问题，只是要求例子和数据，而不是关于汇编语言和更高级语言的扩展话语。

任何人都可以提供一些具体的例子 ，说明汇编程序比使用现代编译器编写得好的C代码更快的速度，并且可以用分析证据来支持这种说法吗？我很确信这些案件是存在的，但是我真的很想知道这些案件究竟有多深奥，因为这似乎是一个争论点。

这是一个真实世界的例子：定点乘法。

这些function不仅适用于没有浮点的设备，而且在精度上也非常出色，因为它们可以提供32位精度和可预测的错误（浮点只有23位，而且更难以预测精度损失）

在32位体系结构上编写一个固定点乘法的一种方法如下所示：

int inline FixedPointMul (int a, int b) { long long a_long = a; // cast to 64 bit. long long product = a_long * b; // perform multiplication return (int) (product >> 16); // shift by the fixed point bias }

这个代码的问题是我们做了一些不能用C语言直接expression的东西。我们想要乘以两个32位数，并得到一个64位的结果，我们返回中间的32位。但是，在C这个乘法不存在。你所能做的就是将整数提升到64位，并进行64 * 64 = 64的乘法运算。

x86（ARM，MIPS等）可以在单个指令中进行乘法运算。许多编译器仍然忽略这个事实，并生成调用运行时库函数来执行乘法的代码。 16位的转换通常也是由一个库例程完成的（x86也可以这样做）。

所以我们只剩一个或两个库调用来进行乘法运算。这有严重的后果。不仅移动速度较慢，寄存器必须在函数调用中保留，也不能帮助内联和代码展开。

如果在汇编器中重写相同的代码，则可以显着提高速度。

除此之外：使用ASM不是解决问题的最佳方法。大多数编译器允许你使用一些内部forms的汇编指令，如果你不能用Cexpression它们。例如VS.NET2008编译器将32 * 32 = 64位mul公开为__emul，64位移位为__ll_rshift。

使用内在函数，您可以用C编译器有机会理解正在发生的事情来重写函数。这允许代码被内联，寄存器分配，共同的子expression式消除以及不断的传播。这样你就可以在手写汇编代码上获得巨大的性能提升。

作为参考：VS.NET编译器的定点mul的最终结果是：

 int inline FixedPointMul (int a, int b) { return (int) __ll_rshift(__emul(a,b),16); }

顺便说一句 – 定点分歧的performance差距更大。我通过编写几条asm-lines来改进分数重定点代码的因子10。

编辑：

使用Visual c ++ 2013为这两种方式提供了相同的汇编代码。

很多年前，我正在教一个人用C编程。这个练习是把graphics旋转90度。他回来的解决scheme花了几分钟的时间完成，主要是因为他在使用乘法和除法等。我向他展示了如何使用位移来重现问题，处理的时间在非零时间内下降到大约30秒，优化他的编译器。我刚刚得到了一个优化的编译器和相同的代码在<5秒内旋转了graphics。我看着编译器生成的汇编代码，从我看到的汇编代码开始，然后我编写汇编程序的日子就结束了。

编译器在看到浮点代码的时候几乎是一样的，手写的版本会更快。主要原因是编译器不能执行任何强大的优化。请参阅MSDN上的这篇文章，以获得有关该主题的讨论。下面是一个例子，汇编版本的速度是C版本的两倍（用VS2K5编译）：

 #include "stdafx.h" #include <windows.h> float KahanSum ( const float *data, int n ) { float sum = 0.0f, C = 0.0f, Y, T; for (int i = 0 ; i < n ; ++i) { Y = *data++ - C; T = sum + Y; C = T - sum - Y; sum = T; } return sum; } float AsmSum ( const float *data, int n ) { float result = 0.0f; _asm { mov esi,data mov ecx,n fldz fldz l1: fsubr [esi] add esi,4 fld st(0) fadd st(0),st(2) fld st(0) fsub st(0),st(3) fsub st(0),st(2) fstp st(2) fstp st(2) loop l1 fstp result fstp result } return result; } int main (int, char **) { int count = 1000000; float *source = new float [count]; for (int i = 0 ; i < count ; ++i) { source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX); } LARGE_INTEGER start, mid, end; float sum1 = 0.0f, sum2 = 0.0f; QueryPerformanceCounter (&start); sum1 = KahanSum (source, count); QueryPerformanceCounter (&mid); sum2 = AsmSum (source, count); QueryPerformanceCounter (&end); cout << " C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl; cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl; return 0; }

和我的电脑运行默认版本构build^*的一些数字：

  C code: 500137 in 103884668 asm code: 500137 in 52129147

出于兴趣，我用dec / jnz交换了循环，这对时间没有什么影响 – 有时更快，有时更慢。我想这个内存有限的方面会使其他的优化变得矮小。

哎呀，我正在运行一个稍微不同的代码版本，并输出错误的方式（即C更快！）的数字。 修复并更新了结果。

没有提供任何具体的例子或分析器的证据，当你比编译器知道更多的时候，你可以编写比编译器更好的汇编器。

在一般情况下，一个现代C编译器知道如何优化有问题的代码：它知道处理器pipe道如何工作，它可以尝试重新sorting指令比人类更快，等等 – 它基本上是相同的一台电脑和桌面游戏最好的人类玩家一样好或者更好，只是因为它可以使问题空间中的search比大多数人更快。虽然理论上你可以像计算机一样在一个特定的情况下执行，但是你当然不能以同样的速度执行，这使得不止一些情况是不可行的（例如，如果你试图编写，编译器肯定会胜过你汇编程序中的几个例程）。

另一方面，有些情况下编译器没有那么多的信息 – 我想说的主要是在使用不同forms的外部硬件时，编译器不知道这些信息。主要的例子可能是设备驱动程序，汇编程序结合人类对有关硬件的熟悉知识可以产生比C编译器更好的结果。

还有一些人提到了特殊目的指令，这就是我在上面所说的 – 编译器可能有限或根本没有任何知识的说明，使得人们可以编写更快的代码。

在我的工作中，有三个理由让我了解和使用程序集。按重要程度排列：

debugging – 我经常得到有错误或文档不完整的库代码。我通过join组装层次来弄清楚它在做什么。我必须每周做一次。我也用它作为debugging问题的工具，在这个问题中，我的眼睛没有发现C / C ++ / C＃中的惯用错误。看着大会过去了。
优化 – 编译器在优化方面做得相当不错，但是我在大多数情况下玩的是不同的球场。我编写的image processing代码通常以如下所示的代码开始：
```
 for (int y=0; y < imageHeight; y++) { for (int x=0; x < imageWidth; x++) { // do something } } 
```
“做某件事”通常发生在几百万次（即3到30次）之间。通过在“做某事”阶段中刮取周期，性能收益被极大地放大。我通常不会从那里开始 – 通常我会先编写代码开始工作，然后尽我所能重构C，使其自然更好（更好的algorithm，更less的循环负载等）。我通常需要阅读程序集来查看正在发生的事情，而且很less需要编写它。我可能每两三个月做一次。
做一些语言不会让我的东西。这些包括 – 获得处理器体系结构和特定的处理器function，访问不在CPU中的标志（man，我真的希望C给你访问进位标志）等。我可能会这样做一年或两年。

只有在使用某些特殊用途的指令集时，编译器才会支持。

为了最大限度地提高具有多个stream水线和预测分支的现代CPU的计算能力，您需要以这样的方式来构造汇编程序，使得a）几乎不可能为人写b）甚至更不可能维护。

此外，更好的algorithm，数据结构和内存pipe理将使您至less比您在assembly中可以进行的微优化更多的性能。

尽pipeC与8位，16位，32位，64位数据的低级操作“接近”，但有一些C不支持的math运算，在某些汇编指令集：

定点乘法：两个16位数的乘积是一个32位数。但C中的规则说，两个16位数的乘积是一个16位数，而两个32位数的乘积是一个32位数 – 两种情况都是下半部分。如果你想要一个16×16乘法的上半部分或一个32×32乘法，你必须与编译器玩游戏。一般的方法是投射到一个大于必要的位宽，乘以，向下移位，然后投射回来：
```
 int16_t x, y; // int16_t is a typedef for "short" // set x and y to something int16_t prod = (int16_t)(((int32_t)x*y)>>16);` 
```
在这种情况下，编译器可能足够聪明，知道你只是试图获得一个16×16乘法的上半部分，并用本机的16×16乘法做正确的事情。或者它可能是愚蠢的，需要一个库调用来做32×32乘法，这是过度的，因为你只需要16位的产品 – 但C标准并没有给你任何方式来expression自己。
某些偏移操作（旋转/进位）：
```
 // 256-bit array shifted right in its entirety: uint8_t x[32]; for (int i = 32; --i > 0; ) { x[i] = (x[i] >> 1) | (x[i-1] << 7); } x[0] >>= 1; 
```
这在C中不是太不雅，但是，除非编译器足够聪明才能意识到你正在做什么，否则将会做很多“不必要”的工作。许多汇编语言指令集允许您在进位寄存器中旋转或左右移位，所以您可以在34条指令中完成上述操作：将指针加载到数组的开头，清除进位，在指针上使用自动增量。

再举一个例子，有线性反馈移位寄存器（LFSR），它们在汇编时被优雅地执行：取一大块N位（8,16,32,64,128等），把整个东西向右移1（见上面algorithm），那么如果得到的进位是1，那么你以代表多项式的位模式异或。

话虽如此，除非我有严重的性能限制，否则我不会诉诸这些技术。正如其他人所说的那样，汇编比C代码更难logging/debugging/testing/维护：性能提升带来了一些严重的成本。

编辑： 3.溢出检测是可能的在汇编（不能真的在C），这使得一些algorithm更容易。

简短的回答？有时。

从技术上讲，每个抽象都有一定的代价，编程语言是CPU如何工作的抽象。 C但是非常接近。几年前，我记得当我login到我的UNIX帐户，并得到以下财富消息（当这样的事情是受欢迎的时候）大声笑出来：

C语言 – 将汇编语言的灵活性与汇编语言的强大function相结合的一种语言。

这很有趣，因为它是真的：C就像便携式汇编语言。

值得注意的是汇编语言只是运行而不是你写的。然而在C语言和它生成的汇编语言之间有一个编译器，这是非常重要的，因为C代码的速度与你的编译器有多好有很大的关系。

当gcc出现的时候，其中一件事情让它如此受欢迎，因为它比许多商业UNIX风格的C编译器好多了。不仅是ANSI C（没有这个K＆R C垃圾），更强大，通常生成更好（更快）的代码。并不总是，但经常。

我把这一切告诉你们，因为对C和汇编程序的速度没有一个统一的规定，因为C没有客观的标准。

同样，汇编程序也会因为您正在运行的处理器，系统规格，您正在使用的指令集等而异。历史上有两个CPU架构系列：CISC和RISC。 CISC最大的玩家仍然是Intel x86架构（和指令集）。 RISC主导着UNIX世界（MIPS6000，Alpha，Sparc等）。 CISC赢得了心灵的战斗。

无论如何，当我还是一个年轻的开发人员时，stream行的智慧是，手写的x86通常比C更快，因为架构的工作方式，它有一个复杂的，从一个人的利益。 RISC另一方面似乎是为编译器devise的，所以没有人（我知道）写了Sparc汇编器。我确信这样的人存在，但毫无疑问，他们现在已经疯了，已经制度化了。

即使在相同的处理器系列中，指令集也是重要的一点。某些英特尔处理器通过SSE4具有像SSE一样的扩展。 AMD有自己的SIMD指令。像C这样的编程语言的好处是有人可以编写他们的库，所以它可以针对你运行的任何处理器进行优化。这是汇编程序的艰辛工作。

编译器中仍然可以对编译器进行优化，编译器可以做得很好，并且编译器生成的代码的速度会快于或等于C的速度。更大的问题是：这是值得的吗？

归根结底，虽然汇编是它的时代的产物，并且在CPU周期昂贵的时候更受欢迎。现在一个花费5-10美元制造的CPU（Intel Atom）几乎可以做任何人都想要的东西。现在编写汇编程序的唯一真正原因是操作系统的某些部分（即使绝大多数Linux内核是用C语言编写的），设备驱动程序，可能是embedded式设备等低层次的东西（尽pipeC往往主宰着那里太）等等。或只是踢（这是有点受虐狂）。

点一个不是答案。
即使你从不编程，我发现知道至less一个汇编指令集是很有用的。这是程序员不断追求知道更多，因此更好的一部分。当进入框架时，您还没有源代码，并至less有一个粗略的想法是什么在进行。它也可以帮助您理解JavaByteCode和.Net IL，因为它们都与汇编器类似。

要回答这个问题，当你有less量的代码或大量的时间。在embedded式芯片中最为有用，因为芯片复杂度低，针对这些芯片的编译器竞争不佳，可能会导致人类的平衡。对于受限制的设备，您通常会以难以指示编译器执行的方式来取代代码大小/内存大小/性能。例如，我知道这个用户操作不是经常被调用，所以我将有小代码大小和性能差，但这种看起来相似的其他function是每秒钟使用，所以我将有一个更大的代码大小和更快的性能。这是熟练的程序员可以使用的那种交易。

我还想补充一点，在C编译和检查生成的程序集中有很多中间的地方，然后改变你的C代码或者调整和维护为程序集。

我的朋友在微型控制器上工作，目前是用于控制小型电动机的芯片。他从事低级别c和Assembly的工作。他曾经告诉过我一个很好的工作日，他把主循环从48条指令减less到了43条。他还面临着类似代码已经增长到填充256k芯片的select，并且业务正在寻求新的function，你

删除现有的function
可能以性能为代价来减小某些或全部现有function的大小。
提倡以更高的成本，更高的功耗和更大的外形尺寸转向更大的芯片。

我想作为一个商业开发人员添加一个组合或语言，平台，应用程序types，我从来没有觉得有必要潜心编写程序集。我曾多么欣赏我所获得的知识。有时debugging到它。

我知道我还有更多的回答“我为什么要学习汇编”的问题，但是我觉得这是一个更重要的问题，那么什么时候更快呢？

所以让我们再试一次你应该考虑assembly

在低级操作系统function上工作
在编译器上工作。
在一个非常有限的芯片，embedded式系统等工作

记住比较你的程序集生成的编译器，看看哪个更快/更小/更好。

大卫。

一个可能不适用的用例，但是对于你的书呆子的乐趣：在Amiga上，CPU和graphics/audio芯片将争取访问某个区域的RAM（具体的第一个2MB的RAM）。所以当你只有2MB的内存（或更less）时，显示复杂的graphics和播放声音将会导致CPU的性能下降。

在汇编程序中，你可以用一种聪明的方式交叉你的代码，当graphics/audio芯片在内部忙时（即总线空闲时），CPU只会尝试访问RAM。因此，通过重新sorting指令，巧妙地使用CPUcaching和总线时序，您可以获得一些根本无法使用更高级语言的效果，因为您必须对每条命令执行时间操作，甚至可以在这里插入NOP以保持各种相互挤出雷达。

这是为什么CPU的NOP（无操作 – 无所事事）指令实际上可以使您的整个应用程序运行得更快的另一个原因。

[编辑]当然，技术取决于一个特定的硬件设置。这就是为什么很多Amiga游戏不能处理更快的CPU的主要原因：指令的时间是closures的。

使用SIMD指令的matrix操作可能比编译器生成的代码更快。

我不能提供具体的例子，因为它是太多年前了，但是有很多情况下，手写汇编可以超越任何编译器。原因：

你可以偏离调用约定，在寄存器中传递参数。
您可以仔细考虑如何使用寄存器，并避免将variables存储在内存中。
对于像跳转表这样的事情，你可以避免必须边界检查索引。

基本上，编译器在优化方面做得相当不错，而且几乎总是“足够好”，但是在某些情况下（比如graphics渲染），您需要为每个循环付出高昂的代价，因为您知道代码，编译器不能，因为它必须在安全的一面。

实际上，我听说过一些graphics渲染代码，在这些代码中，像线画或多边形填充例程这样的例程实际上在堆栈上生成一小块机器代码并在那里执行，以避免连续的决策线条样式，宽度，图案等

That said, what I want a compiler to do is generate good assembly code for me but not be too clever, and they mostly do that. In fact, one of the things I hate about Fortran is its scrambling the code in an attempt to "optimize" it, usually to no significant purpose.

Usually, when apps have performance problems, it is due to wasteful design. These days, I would never recommend assembler for performance unless the overall app had already been tuned within an inch of its life, still was not fast enough, and was spending all its time in tight inner loops.

Added: I've seen plenty of apps written in assembly language, and the main speed advantage over a language like C, Pascal, Fortran, etc. was because the programmer was far more careful when coding in assembler. He or she is going to write roughly 100 lines of code a day, regardless of language, and in a compiler language that's going to equal 3 or 400 instructions.

I'm surprised no one said this. The strlen() function is much faster if written in assembly! In C, the best thing you can do is

 int c; for(c = 0; str[c] != '\0'; c++) {}

while in assembly you can speed it up considerably:

 mov esi, offset string mov edi, esi xor ecx, ecx lp: mov ax, byte ptr [esi] mov bx, byte ptr [esi + 2] cmp al, cl je end_1 cmp ah, cl je end_2 cmp bl, cl je end_3 cmp bh, cl je end_4 add esi, 4 jmp lp end_4: inc esi end_3: inc esi end_2: inc esi end_1: inc esi mov ecx, esi sub ecx, edi

the length is in ecx. This compares 4 characters at time, so it's 4 times faster. And think using the high order word of eax and ebx, it will become 8 times faster that the previous C routine!

You don't actually know whether your well-written C code is really fast if you haven't looked at the disassembly of what compiler produces. Many times you look at it and see that "well-written" was subjective.

So it's not necessary to write in assembler to get fastest code ever, but it's certainly worth to know assembler for the very same reason.

A few examples from my experience:

Access to instructions that are not accessible from C. For instance, many architectures (like x86-64, IA-64, DEC Alpha, and 64-bit MIPS or PowerPC) support a 64 bit by 64 bit multiplication producing a 128 bit result. GCC recently added an extension providing access to such instructions, but before that assembly was required. And access to this instruction can make a huge difference on 64-bit CPUs when implementing something like RSA – sometimes as much as a factor of 4 improvement in performance.
Access to CPU-specific flags. The one that has bitten me a lot is the carry flag; when doing a multiple-precision addition, if you don't have access to the CPU carry bit one must instead compare the result to see if it overflowed, which takes 3-5 more instructions per limb; and worse, which are quite serial in terms of data accesses, which kills performance on modern superscalar processors. When processing thousands of such integers in a row, being able to use addc is a huge win (there are superscalar issues with contention on the carry bit as well, but modern CPUs deal pretty well with it).
SIMD. Even autovectorizing compilers can only do relatively simple cases, so if you want good SIMD performance it's unfortunately often necessary to write the code directly. Of course you can use intrinsics instead of assembly but once you're at the intrinsics level you're basically writing assembly anyway, just using the compiler as a register allocator and (nominally) instruction scheduler. (I tend to use intrinsics for SIMD simply because the compiler can generate the function prologues and whatnot for me so I can use the same code on Linux, OS X, and Windows without having to deal with ABI issues like function calling conventions, but other than that the SSE intrinsics really aren't very nice – the Altivec ones seem better though I don't have much experience with them). As examples of things a (current day) vectorizing compiler can't figure out, read about bitslicing AES or SIMD error correction – one could imagine a compiler that could analyze algorithms and generate such code, but it feels to me like such a smart compiler is at least 30 years away from existing (at best).

On the other hand, multicore machines and distributed systems have shifted many of the biggest performance wins in the other direction – get an extra 20% speedup writing your inner loops in assembly, or 300% by running them across multiple cores, or 10000% by running them across a cluster of machines. And of course high level optimizations (things like futures, memoization, etc) are often much easier to do in a higher level language like ML or Scala than C or asm, and often can provide a much bigger performance win. So, as always, there are tradeoffs to be made.

More often than you think, C needs to do things that seem to be unneccessary from an Assembly coder's point of view just because the C standards say so.

Integer promotion, for example. If you want to shift a char variable in C, one would usually expect that the code would do in fact just that, a single bit shift.

The standards, however, enforce the compiler to do a sign extend to int before the shift and truncate the result to char afterwards which might complicate code depending on the target processor's architecture.

I think the general case when assembler is faster is when a smart assembly programmer looks at the compiler's output and says "this is a critical path for performance and I can write this to be more efficient" and then that person tweaks that assembler or rewrites it from scratch.

Tight loops, like when playing with images, since an image may cosist of millions of pixels. Sitting down and figuring out how to make best use of the limited number of processor registers can make a difference. Here's a real life sample:

http://danbystrom.se/2008/12/22/optimizing-away-ii/

Then often processors have some esoteric instructions which are too specialized for a compiler to bother with, but on occasion an assembler programmer can make good use of them. Take the XLAT instruction for example. Really great if you need to do table look-ups in a loop and the table is limited to 256 bytes!

Updated: Oh, just come to think of what's most crucial when we speak of loops in general: the compiler has often no clue on how many iterations that will be the common case! Only the programmer know that a loop will be iterated MANY times and that it therefore will be beneficial to prepare for the loop with some extra work, or if it will be iterated so few times that the set-up actually will take longer than the iterations expected.

It all depends on your workload.

For day-to-day operations, C and C++ are just fine, but there are certain workloads (any transforms involving video (compression, decompression, image effects, etc)) that pretty much require assembly to be performant.

They also usually involve using CPU specific chipset extensions (MME/MMX/SSE/whatever) that are tuned for those kinds of operation.

I have an operation of transposition of bits that needs to be done, on 192 or 256 bits every interrupt, that happens every 50 microseconds.

It happens by a fixed map(hardware constraints). Using C, it took around 10 microseconds to make. When I translated this to Assembler, taking into account the specific features of this map, specific register caching, and using bit oriented operations; it took less than 3.5 microsecond to perform.

One of the posibilities to the CP/M-86 version of PolyPascal (sibling to Turbo Pascal) was to replace the "use-bios-to-output-characters-to-the-screen" facility with a machine language routine which in essense was given the x, and y, and the string to put there.

This allowed to update the screen much, much faster than before!

There was room in the binary to embed machine code (a few hundred bytes) and there was other stuff there too, so it was essential to squeeze as much as possible.

It turnes out that since the screen was 80×25 both coordinates could fit in a byte each, so both could fit in a two-byte word. This allowed to do the calculations needed in fewer bytes since a single add could manipulate both values simultaneously.

To my knowledge there is no C compilers which can merge multiple values in a register, do SIMD instructions on them and split them out again later (and I don't think the machine instructions will be shorter anyway).

LInux assembly howto , asks this question and gives the pros and cons of using assembly.

One of the more famous snippets of assembly is from Michael Abrash's texture mapping loop ( expained in detail here ):

 add edx,[DeltaVFrac] ; add in dVFrac sbb ebp,ebp ; store carry mov [edi],al ; write pixel n mov al,[esi] ; fetch pixel n+1 add ecx,ebx ; add in dUFrac adc esi,[4*ebp + UVStepVCarry]; add in steps

Nowadays most compilers express advanced CPU specific instructions as intrinsics, ie, functions that get compiled down to the actual instruction. MS Visual C++ supports intrinsics for MMX, SSE, SSE2, SSE3, and SSE4, so you have to worry less about dropping down to assembly to take advantage of platform specific instructions. Visual C++ can also take advantage of the actual architecture you are targetting with the appropriate /ARCH setting.

http://cr.yp.to/qhasm.html has many examples.

gcc has become a widely used compiler. Its optimizations in general are not that good. Far better than the average programmer writing assembler, but for real performance, not that good. There are compilers that are simply incredible in the code they produce. So as a general answer there are going to be many places where you can go into the output of the compiler and tweak the assembler for performance, and/or simply re-write the routine from scratch.

The simple answer… One who knows assembly well (aka has the reference beside him, and is taking advantage of every little processor cache and pipeline feature etc) is guaranteed to be capable of producing much faster code than any compiler.

However the difference these days just doesn't matter in the typical application.

It might be worth looking at Optimizing Immutable and Purity by Walter Bright it's not a profiled test but shows you one good example of a difference between handwritten and compiler generated ASM. Walter Bright writes optimising compilers so it might be worth looking at his other blog posts.

Given the right programmer, Assembler programs can always be made faster than their C counterparts (at least marginally). It would be difficult to create a C program where you couldn't take out at least one instruction of the Assembler.

Longpoke, there is just one limitation: time. When you don't have the resources to optimize every single change to code and spend your time allocating registers, optimize few spills away and what not, the compiler will win every single time. You do your modification to the code, recompile and measure. Repeat if necessary.

Also, you can do a lot in the high-level side. Also, inspecting the resulting assembly may give the IMPRESSION that the code is crap, but in practice it will run faster than what you think would be quicker. 例：

int y = data[i]; // do some stuff here.. call_function(y, …);

The compiler will read the data, push it to stack (spill) and later read from stack and pass as argument. Sounds shite? It might actually be very effective latency compensation and result in faster runtime.

// optimized version call_function(data[i], …); // not so optimized after all..

The idea with the optimized version was, that we have reduced register pressure and avoid spilling. But in truth, the "shitty" version was faster!

Looking at the assembly code, just looking at the instructions and concluding: more instructions, slower, would be a misjudgment.

The thing here to pay attention is: many assembly experts think they know a lot, but know very little. The rules change from architecture to next, too. There is no silver-bullet x86 code, for example, which is always the fastest. These days is better to go by rules-of-thumb:

memory is slow
cache is fast
try to use cached better
how often you going to miss? do you have latency compensation strategy?
you can execute 10-100 ALU/FPU/SSE instructions for one single cache miss
application architecture is important..
.. but it does't help when the problem isn't in the architecture

Also, trusting too much into compiler magically transforming poorly-thought-out C/C++ code into "theoretically optimum" code is wishful thinking. You have to know the compiler and tool chain you use if you care about "performance" at this low-level.

Compilers in C/C++ are generally not very good at re-ordering sub-expressions because the functions have side effects, for starters. Functional languages don't suffer from this caveat but don't fit the current ecosystem that well. There are compiler options to allow relaxed precision rules which allow order of operations to be changed by the compiler/linker/code generator.

This topic is a bit of a dead-end; for most it's not relevant, and the rest, they know what they are doing already anyway.

It all boils down to this: "to understand what you are doing", it's a bit different from knowing what you are doing.

How about creating machine code at run-time?

My brother once (around 2000) realised an extremely fast real-time ray-tracer by generating code at run-time. I can't remember the details, but there was some kind of main module which was looping through objects, then it was preparing and executing some machine code which was specific to each object.

However, over time, this method was outruled by new graphics hardware, and it became useless.

Today, I think that possibly some operations on big-data (millions of records) like pivot tables, drilling, calculations on-the-fly, etc. could be optimized with this method. The question is: is the effort worth it?