编码实践使编译器/优化器能够制作更快的程序

很多年前,C编译器不是特别聪明。 作为一种解决方法,K&R发明了register关键字,向编译器暗示,将这个variables保存在内部寄存器中可能是个好主意。 他们还让高等运营商帮助生成更好的代码。

随着时间的推移,编译器成熟了。 他们变得非常聪明,因为他们的stream程分析使他们能够做出更好的决定,哪些价值在寄存器中比你可能做的更好。 register关键字变得不重要。

由于别名问题,FORTRAN可能比C更快。 从理论上讲,仔细编码,可以绕过这个限制,使优化器生成更快的代码。

哪些编码实践可以使编译器/优化器生成更快的代码?

  • 确定你使用的平台和编译器,将不胜感激。
  • 为什么这项技术似乎工作?
  • 示例代码是鼓励。

这是一个相关的问题

[编辑]这个问题不是关于整个过程的简介,而是优化。 假设程序编写正确,编译完全优化,testing投入生产。 您的代码中可能存在一些构造,禁止优化器尽其所能地完成最好的工作。 你可以做什么重构,将删除这些禁令,并允许优化器生成更快的代码?

[编辑] 偏移相关链接

写入局部variables而不输出参数! 这可以是解决锯齿减速的巨大帮助。 例如,如果你的代码看起来像

 void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut) { for (int i=0; i<numFoo, i++) { barOut.munge(foo1, foo2[i]); } } 

编译器不知道foo1!= barOut,因此每次都要通过循环重新载入foo1。 直到写入barOut完成,它也不能读取foo2 [i]。 你可以开始讨论有限的指针,但是这样做同样有效(也更清晰):

 void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut) { Foo barTemp = barOut; for (int i=0; i<numFoo, i++) { barTemp.munge(foo1, foo2[i]); } barOut = barTemp; } 

这听起来很愚蠢,但编译器可以更聪明地处理局部variables,因为它不可能与任何参数在内存中重叠。 这可以帮助你避免可怕的load-hit-store(在这个线程中由Francis Boivin提到)。

这里有一个编程实践来帮助编译器创build快速代码 – 任何语言,任何平台,任何编译器,任何问题:

不要使用任何强制甚至鼓励编译器将variables放在内存中的巧妙技巧(包括caching和寄存器)。 首先写一个正确和可维护的程序。

接下来,分析你的代码。

然后,只有这样,才可能开始研究告诉编译器如何使用内存的影响。 每次进行1次更改并测量其影响。

预计会感到失望,并且为了小幅的性能改进而必须非常努力地工作。 用于成熟语言(如Fortran和C)的现代编译器非常好。 如果你阅读了一个“技巧”的账户,以便在代码中获得更好的性能,请记住,编译器编写者也已经阅读了它,如果值得这样做的话,可能会实现它。 他们可能写了你读到的东西。

你通过内存的顺序可能会对性能产生深远的影响,编译器并不擅长弄清楚并修复它。 如果您关心性能,则在编写代码时,您必须认真考虑caching局部性问题。 例如,C中的二维数组以行优先格式分配。 以列主要格式遍历数组将使您有更多的caching未命中,并使您的程序比处理器绑定更多的内存绑定:

 #define N 1000000; int matrix[N][N] = { ... }; //awesomely fast long sum = 0; for(int i = 0; i < N; i++){ for(int j = 0; j < N; j++){ sum += matrix[i][j]; } } //painfully slow long sum = 0; for(int i = 0; i < N; i++){ for(int j = 0; j < N; j++){ sum += matrix[j][i]; } } 

通用优化

这里是我最喜欢的一些优化。 实际上我已经增加了执行时间并通过使用这些减less了程序大小。

将小函数声明为inline或macros

对函数(或方法)的每次调用都会导致开销,比如将variables推入堆栈。 有些函数也可能会导致返回的开销。 低效率的函数或方法在其内容中的语句less于组合开销。 这些是内联的好候选者,无论是#definemacros还是inline函数。 (是的,我知道inline只是一个build议,但在这种情况下,我认为这是对编译器的提醒 。)

删除死亡和多余的代码

如果代码没有被使用或者对程序的结果没有贡献,那就把它除掉。

简化algorithm的devise

我曾经从程序中删除了很多汇编代码和执行时间,写下了正在计算的代数方程,然后简化了代数expression式。 简化的代数expression式的实现占用了比原始函数更less的空间和时间。

循环展开

每个循环都有增量和终止检查的开销。 要估计性能因子,请计算开销中的指令数(最小3:增量,检查,循环开始),并除以循环内的语句数。 数字越低越好。

编辑:提供一个循环展开的例子之前:

 unsigned int sum = 0; for (size_t i; i < BYTES_TO_CHECKSUM; ++i) { sum += *buffer++; } 

展开后:

 unsigned int sum = 0; size_t i = 0; **const size_t STATEMENTS_PER_LOOP = 8;** for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**) { sum += *buffer++; // 1 sum += *buffer++; // 2 sum += *buffer++; // 3 sum += *buffer++; // 4 sum += *buffer++; // 5 sum += *buffer++; // 6 sum += *buffer++; // 7 sum += *buffer++; // 8 } // Handle the remainder: for (; i < BYTES_TO_CHECKSUM; ++i) { sum += *buffer++; } 

在这个优点中,获得了第二个好处:在处理器必须重新加载指令高速caching之前执行更多的语句。

当我将一个循环展开到32个语句时,我已经得到了惊人的结果。 这是程序必须计算2GB文件校验和的瓶颈之一。 这种优化结合块读取性能从1小时增加到5分钟。 循环展开在汇编语言中也提供了很好的性能,我的memcpy比编译器的memcpy快得多。 – TM值

减lessif语句

处理器讨厌分支或跳转,因为它迫使处理器重新加载它的指令队列。

布尔算术( 编辑:应用代码格式代码片段,添加示例)

if语句转换为布尔赋值。 一些处理器可以有条件地执行指令而不分支:

 bool status = true; status = status && /* first test */; status = status && /* second test */; 

逻辑AND运算符( && )的短路会阻止testing在statusfalse

例:

 struct Reader_Interface { virtual bool write(unsigned int value) = 0; }; struct Rectangle { unsigned int origin_x; unsigned int origin_y; unsigned int height; unsigned int width; bool write(Reader_Interface * p_reader) { bool status = false; if (p_reader) { status = p_reader->write(origin_x); status = status && p_reader->write(origin_y); status = status && p_reader->write(height); status = status && p_reader->write(width); } return status; }; 

循环之外的因子variables分配

如果一个variables是在循环中dynamic创build的,请将创build/分配移至循环之前。 在大多数情况下,variables不需要在每次迭代中分配。

循环外的因子常量expression式

如果计算或variables值不依赖于循环索引,则将其移到循环外部(之前)。

I / O在块

以大块(块)读写数据。 越大越好。 例如,一次读取一个八位字节的效率低于一次读取1024个八位字节的效率。
例:

 static const char Menu_Text[] = "\n" "1) Print\n" "2) Insert new customer\n" "3) Destroy\n" "4) Launch Nasal Demons\n" "Enter selection: "; static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0'); //... std::cout.write(Menu_Text, Menu_Text_Length); 

这种技术的效率可以直观地展现出来。 🙂

不要使用printf 系列来保存常量数据

可以使用块写入输出常量数据。 格式化的写入将浪费时间扫描文本以格式化字符或处理格式化命令。 见上面的代码示例。

格式化为内存,然后写入

格式化为使用多个sprintfchar数组,然后使用fwrite 。 这也使得数据布局可以分解为“常量部分”和可变部分。 想想邮件合并

声明常量文本(string文字)为static const

当variables声明没有static ,一些编译器可以在堆栈上分配空间并从ROM中复制数据。 这是两个不必要的操作。 这可以通过使用static前缀来解决。

最后,像编译器那样的代码

有时,编译器可以比一个复杂版本更好地优化几个小语句。 另外,编写代码来帮助编译器优化也是有帮助的。 如果我希望编译器使用特殊的块传输指令,我将编写看起来应该使用特殊指令的代码。

优化器并不真正控制你的程序的性能,你是。 使用适当的algorithm和结构以及简介,简介,简介。

也就是说,你不应该从另一个文件中的一个文件内循环一个小函数,因为这会阻止它被内联。

如果可能,避免使用variables的地址。 询问指针不是“空闲的”,因为这意味着variables需要保存在内存中。 即使数组可以保存在寄存器,如果你避免指针 – 这是vector化至关重要。

这导致下一点, 阅读^#$ @手册 ! GCC可以vector化纯C代码,如果你在这里洒一个__restrict__和一个__attribute__( __aligned__ ) 。 如果你想从优化器中获得非常特定的东西,你可能需要具体的。

在大多数现代处理器上,最大的瓶颈就是内存。

别名:Load-Hit-Store可能会造成严重的环路破坏。 如果您正在读取一个内存位置并写入另一个内存位置,并知道它们不相交,请仔细在函数参数上添加一个别名关键字可以帮助编译器生成更快的代码。 但是,如果内存区域重叠并且使用了“别名”,那么您就可以很好地进行未定义行为的debugging会话了!

caching未命中:由于它主要是algorithm的,所以不确定如何帮助编译器,但是有预取内存的内在机制。

也不要尝试将浮点值转换为int,反之亦然,因为它们使用不同的寄存器,并且从一种types转换为另一种types意味着调用实际的转换指令,将值写入存储器并将其读回正确的寄存器组。

绝大多数人编写的代码都是I / O绑定的(我相信过去30年来我写的所有代码都是如此绑定的),所以大多数人的优化器的活动都是学术的。

然而,我会提醒人们,为了优化代码,你必须告诉编译器优化它 – 许多人(包括我忘了的时候)发布C ++基准testing,如果没有启用优化器,这些testing是毫无意义的。

尽可能在代码中使用const正确性。 它允许编译器优化得更好。

在这个文档中有大量的其他优化技巧: CPP优化 (尽pipe有点旧文档)

强调:

  • 使用构造函数初始化列表
  • 使用前缀运算符
  • 使用显式的构造函数
  • 内联函数
  • 避免临时对象
  • 了解虚拟function的成本
  • 通过引用参数返回对象
  • 考虑每class分配
  • 考虑stl容器分配器
  • “空成员”的优化
  • 等等

尝试尽可能使用静态单个赋值进行编程。 SSA与大多数函数式编程语言中的结果完全相同,这就是大多数编译器将您的代码转换为进行优化的原因,因为它更容易处理。 通过这样做,编译器可能会感到困惑的地方被揭示出来。 它也使除了最差的寄存器分配器和最好的寄存器分配器一样工作,并且允许你更容易地进行debugging,因为你几乎不必知道variables从哪里得到它的价值,因为只有一个地方被分配了。
避免全局variables。

当通过引用处理数据或将指针拉入局部variables时,请执行您的工作,然后将其复制回来。 (除非你有充分的理由不要)

使用几乎免费的比较,大多数处理器在进行math或逻辑运算时会给出0。 你几乎总是得到一个== 0和<0的标志,从中你可以很容易得到3个条件:

 x= f(); if(!x){ a(); } else if (x<0){ b(); } else { c(); } 

几乎总是比testing其他常量便宜。

另一个诀窍是使用减法来消除范围testing中的一个比较。

 #define FOO_MIN 8 #define FOO_MAX 199 int good_foo(int foo) { unsigned int bar = foo-FOO_MIN; int rc = ((FOO_MAX-FOO_MIN) < bar) ? 1 : 0; return rc; } 

这通常可以避免在布尔expression式上进行短路的语言的跳跃,并避免编译器不得不尝试弄清楚如何处理第一次比较的结果,同时结合第二次比较。 这可能看起来像是有可能使用一个额外的寄存器,但它几乎从来没有。 通常情况下,你不再需要foo了,如果你还没有使用rc,那么它可以去那里。

当使用c(strcpy,memcpy,…)中的string函数时,请记住它们返回的内容 – 目标! 你通常可以通过“忘记”指向目的地的指针副本来获得更好的代码,只需从这些函数的返回中抓取它即可。

不要忽视oppurtunity返回完全相同的东西,你最后的函数返回。 编译器并不是很擅长:

 foo_t * make_foo(int a, int b, int c) { foo_t * x = malloc(sizeof(foo)); if (!x) { // return NULL; return x; // x is NULL, already in the register used for returns, so duh } x->a= a; x->b = b; x->c = c; return x; } 

当然,如果只有一个返回点,则可以将逻辑逆转。

(我后来回忆的技巧)

当你可以声明函数是静态的时候总是一个好主意。 如果编译器可以certificate自己已经占了每个特定函数的调用者,那么它可以以优化的名义打破该函数的调用约定。 编译器通常可以避免将参数移动到寄存器或堆栈位置,这些寄存器或堆栈位置通常需要调用函数的参数(它必须偏离被调用函数和所有调用者的位置才能执行此操作)。 编译器也可以经常利用知道被调用函数将需要什么内存和寄存器,并避免生成代码来保存被调用函数不会干扰的寄存器或存储器位置中的variables值。 当一个函数调用很less的时候,这个效果特别好。 这样可以获得很多内联代码的好处,但实际上并没有内联。

我编写了一个优化的C编译器,这里有一些非常有用的东西需要考虑:

  1. 使大多数function静态。 这允许进程间常量传播和别名分析来完成它的工作,否则编译器需要假定可以从翻译单元之外调用具有参数完全未知的值的函数。 如果你看着名的开源库,它们都将静态函数标记为静态,除了那些真正需要extern的函数。

  2. 如果使用全局variables,则尽可能将其标记为静态和常量。 如果它们被初始化一次(只读),最好使用一个初始化列表,比如static const int VAL [] = {1,2,3,4},否则编译器可能不会发现variables实际上是初始化的常量和将无法用常量replacevariables的负载。

  3. 切勿在循环内部使用goto,大多数编译器都不会再识别该循环,也不会应用最重要的优化。

  4. 只有在必要时才使用指针参数,并在可能的情况下标记它们。 这有助于别名分析,因为程序员保证没有别名(过程间别名分析通常是非常原始的)。 非常小的结构对象应该通过值传递,而不是通过引用。

  5. 尽可能使用数组而不是指针,特别是在循环内部(a [i])。 一个数组通常会提供更多的别名分析信息,经过一些优化之后,无论如何都会生成相同的代码(如果好奇的话,search循环强度的减less)。 这也增加了应用循环不变代码运动的机会。

  6. 试着在循环调用外调用大函数或外部函数,不要有副作用(不要依赖于当前的循环迭代)。 小函数在很多情况下被内联或转换为容易提升的内在函数,但是大的函数在编译器实际上不具有副作用的时候可能看起来很大。 外部函数的副作用是完全未知的,除了某些编译器build模的标准库中的一些函数以外,使得循环不变的代码运动成为可能。

  7. 当写多个条件的testing时,首先放置最有可能的testing。 如果(a || b || c)应该是if(b || a || c),如果b比其他的更可能是真的。 编译器通常不了解条件的可能值,以及哪些分支需要更多(可以通过使用configuration文件信息知道,但很less有程序员使用它)。

  8. 使用开关比做一个testing(a || b || … || z)要快。 首先检查你的编译器是否自动执行这个操作,其中一部分会执行, 如果有的话可读性更好。

一个愚蠢的小提示,但会为您节省一些微小的速度和代码量。

总是以相同的顺序传递函数参数。

如果有f_1(x,y,z)调用f_2,则将f_2声明为f_2(x,y,z)。 不要将其声明为f_2(x,z,y)。

原因是C / C ++平台ABI(AKA调用约定)承诺在特定的寄存器和堆栈位置传递参数。 当参数已经在正确的寄存器中时,它不必移动它们。

在阅读反汇编代码的时候,我看到一些荒谬的寄存器洗牌,因为人们不遵守这个规则。

在embedded式系统和使用C / C ++编写代码的情况下,我尽可能地避免dynamic内存分配 。 我这样做的主要原因不一定是performance,但是这个经验法则确实有性能影响。

用于pipe理堆的algorithm在某些平台(例如vxworks)中出了名。 更糟糕的是,从调用malloc返回所花费的时间高度依赖于堆的当前状态。 因此,任何调用malloc的函数都会带来一个不容易解释的性能问题。 如果堆仍然是干净的,但是在设备运行一段时间之后堆可能变成碎片,性能可能会很小。 通话时间会更长,您无法轻松计算性能会随着时间的推移而降低。 你不能真的产生一个更坏的情况下估计。 优化器在这种情况下也不能提供任何帮助。 更糟糕的是,如果这个堆变得太碎片化了,那么这个调用就会开始失败。 解决scheme是使用内存池(例如glib切片 )而不是堆。 如果你做得对,configuration调用将会更快,更确定。

在上面的列表中没有看到两种编码技术:

通过将代码编写为独特的源代码来绕过链接器

虽然单独的编译对于编译时间非常好,但是当您谈论优化时,这是非常糟糕的。 基本上编译器不能优化编译单元,即链接器保留域。

但是,如果你devise好你的程序,你也可以通过一个独特的通用源代码进行编译。 这不是编译unit1.c和unit2.c,而是连​​接两个对象,编译all.c只编译#include unit1.c和unit2.c。 Thus you will benefit from all the compiler optimizations.

It's very like writing headers only programs in C++ (and even easier to do in C).

This technique is easy enough if you write your program to enable it from the beginning, but you must also be aware it change part of C semantic and you can meet some problems like static variables or macro collision. For most programs it's easy enough to overcome the small problems that occurs. Also be aware that compiling as an unique source is way slower and may takes huge amount of memory (usually not a problem with modern systems).

Using this simple technique I happened to make some programs I wrote ten times faster!

Like the register keyword, this trick could also become obsolete soon. Optimizing through linker begin to be supported by compilers gcc: Link time optimization .

Separate atomic tasks in loops

This one is more tricky. It's about interaction between algorithm design and the way optimizer manage cache and register allocation. Quite often programs have to loop over some data structure and for each item perform some actions. Quite often the actions performed can be splitted between two logically independent tasks. If that is the case you can write exactly the same program with two loops on the same boundary performing exactly one task. In some case writing it this way can be faster than the unique loop (details are more complex, but an explanation can be that with the simple task case all variables can be kept in processor registers and with the more complex one it's not possible and some registers must be written to memory and read back later and the cost is higher than additional flow control).

Be careful with this one (profile performances using this trick or not) as like using register it may as well give lesser performances than improved ones.

I've actually seen this done in SQLite and they claim it results in performance boosts ~5%: Put all your code in one file or use the preprocessor to do the equivalent to this. This way the optimizer will have access to the entire program and can do more interprocedural optimizations.

Most modern compilers should do a good job speeding up tail recursion , because the function calls can be optimized out.

例:

 int fac2(int x, int cur) { if (x == 1) return cur; return fac2(x - 1, cur * x); } int fac(int x) { return fac2(x, 1); } 

Of course this example doesn't have any bounds checking.

Late Edit

While I have no direct knowledge of the code; it seems clear that the requirements of using CTEs on SQL Server were specifically designed so that it can optimize via tail-end recursion.

Don't do the same work over and over again!

A common antipattern that I see goes along these lines:

 void Function() { MySingleton::GetInstance()->GetAggregatedObject()->DoSomething(); MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingElse(); MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingCool(); MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingReallyNeat(); MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingYetAgain(); } 

The compiler actually has to call all of those functions all of the time. Assuming you, the programmer, knows that the aggregated object isn't changing over the course of these calls, for the love of all that is holy…

 void Function() { MySingleton* s = MySingleton::GetInstance(); AggregatedObject* ao = s->GetAggregatedObject(); ao->DoSomething(); ao->DoSomethingElse(); ao->DoSomethingCool(); ao->DoSomethingReallyNeat(); ao->DoSomethingYetAgain(); } 

In the case of the singleton getter the calls may not be too costly, but it is certainly a cost (typically, "check to see if the object has been created, if it hasn't, create it, then return it). The more complicated this chain of getters becomes, the more wasted time we'll have.

  1. Use the most local scope possible for all variable declarations.

  2. Use const whenever possible

  3. Dont use register unless you plan to profile both with and without it

The first 2 of these, especially #1 one help the optimizer analyze the code. It will especially help it to make good choices about what variables to keep in registers.

Blindly using the register keyword is as likely to help as hurt your optimization, It's just too hard to know what will matter until you look at the assembly output or profile.

There are other things that matter to getting good performance out of code; designing your data structures to maximize cache coherency for instance. But the question was about the optimizer.

Align your data to native/natural boundaries.

I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).

The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.

Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.

The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.

A neat technique I learned from @MSalters comment on this answer allows compilers to do copy elision even when returning different objects according to some condition:

 // before BigObject a, b; if(condition) return a; else return b; // after BigObject a, b; if(condition) swap(a,b); return a; 

If you've got small functions you call repeatedly, i have in the past got large gains by putting them in headers as "static inline". Function calls on the ix86 are surprisingly expensive.

Reimplementing recursive functions in a non-recursive way using an explicit stack can also gain a lot, but then you really are in the realm of development time vs gain.

Here's my second piece of optimisation advice. As with my first piece of advice this is general purpose, not language or processor specific.

Read the compiler manual thoroughly and understand what it is telling you. Use the compiler to its utmost.

I agree with one or two of the other respondents who have identified selecting the right algorithm as critical to squeezing performance out of a program. Beyond that the rate of return (measured in code execution improvement) on the time you invest in using the compiler is far higher than the rate of return in tweaking the code.

Yes, compiler writers are not from a race of coding giants and compilers contain mistakes and what should, according to the manual and according to compiler theory, make things faster sometimes makes things slower. That's why you have to take one step at a time and measure before- and after-tweak performance.

And yes, ultimately, you might be faced with a combinatorial explosion of compiler flags so you need to have a script or two to run make with various compiler flags, queue the jobs on the large cluster and gather the run time statistics. If it's just you and Visual Studio on a PC you will run out of interest long before you have tried enough combinations of enough compiler flags.

问候

标记

When I first pick up a piece of code I can usually get a factor of 1.4 — 2.0 times more performance (ie the new version of the code runs in 1/1.4 or 1/2 of the time of the old version) within a day or two by fiddling with compiler flags. Granted, that may be a comment on the lack of compiler savvy among the scientists who originate much of the code I work on, rather than a symptom of my excellence. Having set the compiler flags to max (and it's rarely just -O3) it can take months of hard work to get another factor of 1.05 or 1.1

When DEC came out with its alpha processors, there was a recommendation to keep the number of arguments to a function under 7, as the compiler would always try to put up to 6 arguments in registers automatically.

For performance, focus first on writing maintenable code – componentized, loosely coupled, etc, so when you have to isolate a part either to rewrite, optimize or simply profile, you can do it without much effort.

Optimizer will help your program's performance marginally.

You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say

Assume that the program has been written correctly, compiled with full optimization, tested and put into production.

In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.

If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization . Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.

After that is done, micro-optimization (of the hot-spots) can give you a good payoff.

i use intel compiler. on both Windows and Linux.

when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.

if a code is a computational one and contain a lot of loops – vectorization report in intel compiler is very helpful – look for 'vec-report' in help.

so the main idea – polish the performance critical code. as for the rest – priority to be correct and maintainable – short functions, clear code that could be understood 1 year later.

One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.

This has benefit in the case where I need a large vector of these classes.

I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.

As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.

One thing I've done is try to keep expensive actions to places where the user might expect the program to delay a bit. Overall performance is related to responsiveness, but isn't quite the same, and for many things responsiveness is the more important part of performance.

The last time I really had to do improvements in overall performance, I kept an eye out for suboptimal algorithms, and looked for places that were likely to have cache problems. I profiled and measured performance first, and again after each change. Then the company collapsed, but it was interesting and instructive work anyway.

I have long suspected, but never proved that declaring arrays so that they hold a power of 2, as the number of elements, enables the optimizer to do a strength reduction by replacing a multiply by a shift by a number of bits, when looking up individual elements.

Put small and/or frequently called functions at the top of the source file. That makes it easier for the compiler to find opportunities for inlining.