性能优化策略不得已而为之

这个网站上已经有很多性能问题了，但是几乎所有的问题都是针对特定问题的，而且相当狭窄。几乎所有的重复build议，以避免过早优化。

我们假设：

代码已经正常工作
所select的algorithm对于问题的情况已经是最佳的
代码已被测量，并且违规程序已被隔离
所有优化的尝试也将被测量，以确保它们不会让事情变得更糟

我在这里寻找的是一些策略和技巧，当没有其他任何事情可以做，但是无论如何，都可以挤压到关键algorithm中的最后几个百分比。

理想情况下，尽量使语言无关的答案，并在适用的情况下指出build议策略的任何不利方面。

我会用我自己的最初build议来添加一个回复，并期待Stack Overflow社区能够想到的任何其他内容。

好的，你正在把问题定义为看起来没有太多改进空间的地方。根据我的经验，这是相当罕见的。我试图在1993年11月的多布斯博士的文章中解释这个问题，从一个没有明显浪费的传统devise的非平凡程序开始，并通过一系列的优化，直到它的挂钟时间从48秒到1.1秒，源代码大小减less了4倍。我的诊断工具是这样的。变化的顺序是这样的：

发现的第一个问题是占用半数以上的列表集群（现在称为“迭代器”和“容器类”）。那些被相当简单的代码取代，把时间缩短到20秒。
现在最大的接手者是更多的名单build设。作为一个比例，之前并没有那么大，但现在是因为更大的问题被解决了。我find了加速的方法，时间下降到17秒。
现在很难find明显的罪魁祸首，但也有一些小的我可以做点什么，时间下降到13秒。

现在我好像撞墙了。样品告诉我它到底在做什么，但我似乎无法find任何可以改进的地方。然后，我反思了程序的基本devise，以及它的事务驱动结构，并询问它所做的所有列表search是否实际上是由问题的要求来规定的。

然后我重新进行了一次重新devise，其中程序代码实际上是由一个较小的源代码生成的（通过预处理器macros），而程序不是经常搞清楚程序员知道的东西是否可预测。换句话说，不要“解释”要做的事情的顺序，“编译”它。

重新devise完成后，将源代码缩小4倍，时间减less到10秒。

现在，因为速度太快，很难抽样，所以我给它做了10倍的工作，但是下面的时间是基于原来的工作量。

更多的诊断表明，它正在花时间在队列pipe理。内嵌这些减less了7秒的时间。
现在一个大的接受者是我一直在做的诊断印刷。冲洗 – 4秒。
现在最大的接受者是malloc和free的调用。回收对象 – 2.6秒。
继续抽样，我仍然发现并非绝对必要的操作–1.1秒。

总加速因子：43.6

现在没有两个程序是相似的，但在非玩具软件中，我总是看到这样的进展。首先你得到了简单的东西，然后就越困难，直到你达到收益递减点。那么你获得的洞察力很可能导致重新devise，开始新一轮的加速，直到你再次达到收益递减。现在，这个问题可能会让人怀疑， ++i或i++或for(;;)或while(1)是否更快：我常常在SO上看到的问题types。

PS可能会想知道为什么我不使用分析器。答案是这些“问题”几乎每一个都是一个函数调用的网站，这些堆栈样本是精确定位的。即使在今天，Profiler也只是几乎没有想到声明和调用指令比定位更重要，而且比整个函数更容易修复。实际上，我build立了一个分析器来完成这个工作，但是对于一个真正的代码在干什么的亲密关系来说，没有任何东西可以替代它。样本数量不是很less，因为没有发现的问题太小而容易被忽略。

补充：jerryjvl请求了一些例子。这是第一个问题。它由less量单独的代码行组成，占用了一半的时间：

  /* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */ if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){ . . . /* FOR EACH OPERATION REQUEST */ for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){ . . . /* GET CURRENT TASK */ ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

这些使用列表集群ILST（类似于列表类）。它们以通常的方式实施，“信息隐藏”意味着class级的用户不应该在意如何实施。当写这些行（大约800行代码）时，没有想到这些可能是一个“瓶颈”（我讨厌这个词）。他们只是推荐的做事方式。事后很容易说这些应该是可以避免的，但是根据我的经验， 所有的性能问题都是这样的。一般来说，尽量避免造成性能问题是一件好事。即使它们“应该被避免”（事后看来），find并修复所创build的更好。我希望能给出一点点味道。

这是第二个问题，分两行：

  /* ADD TASK TO TASK LIST */ ILST_APPEND(ptop->tasklist, ptask) . . . /* ADD TRANSACTION TO TRANSACTION QUEUE */ ILST_APPEND(trnque, ptrn)

这些是通过将项目附加到它们的末端来build立列表。（修正是收集数组中的项目，并一次build立所有的列表）。有趣的是，这些语句只花费了原始时间的3/48（即在调用堆栈上），所以它们不在事实上在一开始就是个大问题。然而，在消除第一个问题后，他们花费了3/20的时间，所以现在是“更大的鱼”。一般来说，就是这样。

我可能会补充说，这个项目是从一个真正的项目中提炼出来的。在那个项目中，性能问题更加剧烈（加速也是如此），例如在内部循环中调用数据库访问例程来查看任务是否完成。

参考文献：源代码，原始和重新devise，可以在www.ddj.comfind，1993年，文件9311.zip，文件slug.asc和slug.zip。

编辑2011/11/26：现在有一个包含Visual C ++源代码的sourceforge项目，以及它是如何被调优的。它只经过上述情景的前半部分，并不完全遵循相同的顺序，但仍然得到2-3个数量级的加速。

build议：

预先计算而不是重新计算 ：包含具有相对有限input范围的计算的任何循环或重复调用，考虑对包含有效范围内的所有值的计算结果进行查找（数组或字典）投入。然后在algorithm中使用简单的查找。
方面：如果实际使用的预计算值很less，这可能会使情况变得更糟，查找也可能占用大量的内存。
不要使用库方法 ：大多数库需要编写在各种场景下正确运行，并对参数执行空检查等。通过重新实现一个方法，您可能会去掉大量的逻辑并不适用于您正在使用它的确切情况。
下方：编写额外的代码意味着更多的表面区域的错误。
用图书馆的方法 ：与我自己相反，语言图书馆是由比我们更聪明的人写的; 他们做得更好，更快。不要自己实现它，除非你能更快地实现它（即：总是测量！）
作弊：在某些情况下，尽pipe您的问题可能存在确切的计算方法，但您可能不需要“确切”，有时候可能“足够好”，并且交易速度会更快。问问自己，答案是否超出1％真的很重要吗？ 5％？甚至10％？
下方：呃…答案并不准确。

如果不能再提高性能，请参阅是否可以提高感知性能。

你可能无法使你的fooCalcalgorithm更快，但是通常有办法让你的应用程序看起来更能响应用户。

几个例子：

预测用户将要求什么，然后开始工作
显示他们进来的结果，而不是一下子结束
准确的进度表

这些不会让你的程序更快，但它可能会让你的用户更快乐。

我在这个地方度过了大半生。广泛的笔触是运行你的分析器，并logging下来：

caching未命中 。数据caching是大多数程序中排名第一的来源。通过重组违规数据结构来提高caching命中率，使其具有更好的局部性; 打包结构和数字types，以消除浪费的字节（并因此浪费高速caching提取）; 尽可能预取数据以减less失速。
负载打击商店 。关于指针混叠的编译器假设，以及数据在通过内存断开连接的寄存器组之间移动的情况，可能会导致一定的病态行为，导致整个CPUpipe道在加载操作中清除。 find漂浮物，vector和投入物相互投射的地方，并消除它们。宽松地使用__restrict来承诺编译器关于别名。
微码操作 。大多数处理器有一些不能被stream水线化的操作，而是运行存储在ROM中的一个小子程序。 PowerPC上的例子是整数乘法，除法和逐variables。问题是整个pipe道在执行这个操作时停止。尽量消除这些操作的使用，或者至less把它们分解成组成的stream水线操作，这样你就可以在程序的其他部分正在做的事情上得到超标量调度的好处。
分行错误预测 。这些也是空的pipe道。查找CPU在分支之后花费大量时间重新填充pipe道的情况，并使用分支提示（如果可用）使其更频繁地正确预测。或者更好的是，尽可能用条件移动replace分支， 特别是在浮点操作之后，因为它们的pipe道通常更深，读取fcmp之后的条件标志会导致失速。
顺序浮点运算 。做这些SIMD。

还有一件事我喜欢做：

设置您的编译器输出汇编列表，并查看它发出的代码中的热点function。所有这些聪明的优化，“一个好的编译器应该能够自动为你做”？机会是你的实际编译器不这样做。我见过GCC发出真正的WTF代码。

抛出更多的硬件！

更多build议：

避免I / O ：任何I / O（磁盘，networking，端口等）总是比任何正在执行计算的代码慢得多，所以摆脱任何你并不严格需要的I / O。
预先移动I / O ：将预先需要的所有数据加载到计算中，以便在重要algorithm的核心内不会有重复的I / O等待（也可能是因为重复磁盘寻道，当加载所有的数据在一个命中可能会避免寻求）。
延迟I / O ：在计算结束之前不要写出结果，将其存储在数据结构中，然后在完成工作时最后一次性转储结果。
Threaded I / O ：对于那些敢于冒险的人来说，将“I / O预先”或“延迟I / O”与实际计算结合起来，通过将加载移动到并行线程中，以便在加载更多数据时，在对已有数据进行计算时，或在计算下一批数据时，可以同时写出最后一批的结果。

由于许多性能问题涉及数据库问题，所以在调整查询和存储过程时，我会给你一些具体的东西。

避免在大多数数据库中使用游标。避免循环。大多数时候，数据访问应该是基于设置的，而不是由logging处理logging的。这包括当您要一次插入1,000,000个logging时不重复使用单个logging存储过程。

切勿使用select *，只返回实际需要的字段。如果有任何连接，连接字段将被重复，并因此在服务器和networking上造成不必要的负载，则这尤其如此。

避免使用相关的子查询。使用连接（如果可能，包括派生表的连接）（我知道这对于Microsoft SQL Server是正确的，但在使用不同的后端时testingbuild议）。

指数，指数，指数。如果适用于您的数据库，请更新这些统计信息。

使查询sargable 。意思是避免使用索引无法使用的东西，例如在连接的类似子句或函数的第一个字符中使用通配符，或者作为where语句的左边部分。

使用正确的数据types。在date字段上进行datemath比在必须尝试将string数据types转换为date数据types后执行计算要快。

切勿将任何types的循环放入触发器！

大多数数据库都有办法检查查询执行的方式。在Microsoft SQL Server中，这被称为执行计划。首先检查那些问题区域在哪里。

考虑查询运行的频率以及运行需要多长时间才能确定哪些需要优化。有时你可以通过轻微的调整来获得更多的性能，每天运行数百万次的查询，而不是从每个月只运行一次的long_running查询中清除时间。

使用某种分析器工具来找出真正被发送到数据库的内容。我可以记得有一次，我们无法弄清楚当存储过程很快时为什么页面加载速度太慢，通过分析发现网页多次询问查询而不是一次。

分析器还将帮助您查找谁阻止了谁。由于来自其他查询的锁，一些单独运行的快速执行的查询可能会变得非常慢。

今天唯一最重要的限制因素是有限的记忆bandwitdh 。由于带宽是在核心之间共享的，多核只会使情况更糟。另外，专用于实现高速caching的有限的芯片面积也在内核和线程之间分配，更加恶化了这个问题。最后，保持不同高速caching一致性所需的片间信令也随着内核数量的增加而增加。这也增加了一个惩罚。

这些是你需要pipe理的效果。有时通过微观pipe理你的代码，但有时通过仔细的考虑和重构。

很多评论已经提到了caching友好的代码。至less有两种不同的风味：

避免内存取指延迟。
降低内存总线压力（带宽）。

第一个问题主要是使数据访问模式更加规律，使硬件预取器能够高效工作。避免dynamic内存分配将数据对象分散在内存中。使用线性容器而不是链接列表，哈希和树。

第二个问题与改进数据重用有关。修改algorithm以处理适合可用caching的数据子集，并在caching中尽可能多地重用该数据。

打包数据更加紧凑，并确保在热循环中使用高速caching行中的所有数据，将有助于避免这些其他影响，并允许在caching中安装更多有用的数据。

你在运行什么硬件？你可以使用平台特定的优化（如vector化）吗？
你能得到一个更好的编译器？例如从GCC切换到英特尔？
你可以让你的algorithm并行运行吗？
你可以通过重组数据来减lesscaching未命中吗？
你可以禁用断言？
针对您的编译器和平台进行微型优化。在“在一个if / else中”的风格，首先要把最常见的陈述

内联例程（消除调用/返回和参数推送）
尝试用表查找消除testing/开关（如果它们更快）
展开循环（Duff的设备）到刚好适合CPUcaching的位置
本地化内存访问，以免打击caching
本地化相关的计算，如果优化器还没有这样做
如果优化器尚未这样做，则消除循环不variables

您应该考虑“Google透视图”，也就是确定您的应用程序如何在很大程度上并行化和并发化，这也意味着在某些时候需要考虑将应用程序分布在不同的机器和networking上，以便能够理想地按比例线性扩展与你扔在硬件。

另一方面，谷歌也因为投入了大量的人力和资源来解决他们正在使用的项目，工具和基础设施中的一些问题，比如像海湾合作委员会的整个项目优化，有一个专门的工程师团队攻击gcc内部，以便为Google典型的用例场景做好准备。

类似地，对应用程序进行分析不再意味着简单地对程序代码进行概要分析，而是将其所有周边的系统和基础架构（思考networking，交换机，服务器，RAIDarrays）从系统angular度确定为冗余和优化潜力。

虽然我喜欢麦克·邓拉维的回答，但实际上这个答案的确是一个很好的答案，我认为可以这样简单地expression：

找出最先花费的时间，并了解原因。

这是时间的识别过程，可以帮助您了解您必须完善algorithm的位置。这是我能find的一个全面涵盖的语言不可知的答案，这个答案已经被认为是完全优化的。也假设你想在build立独立的速度追求。

所以虽然algorithm可能被优化，但它的实现可能不是。标识可以让你知道哪个部分是哪个：algorithm或实现。因此，无论哪一个时间，大部分时间都是您的主要候选人。但是既然你说你想挤掉最后几个百分比，那么你可能还需要首先仔细检查一下你没有仔细检查过的部分。

最后，通过不同的方法来实现相同的解决scheme或可能不同的algorithm的性能数据的试验和错误，可以带来洞察力，帮助识别浪费时间和节省时间。

HPH，asoudmove。

当你使用高效algorithm的时候，你需要更多的速度或者内存 。使用caching“在内存中”支付更多的速度或使用计算来减less内存占用。
如果可能的话（而且更具成本效益）会抛出硬件的问题 – 更快的CPU，更多的内存或HD可以更快地解决问题，然后尝试编码。
如果可能， 使用并行 – 在多个线程上运行部分代码。
使用正确的工具来完成这项工作 。一些编程语言创build更高效的代码，使用托pipe代码（即Java / .NET）加速开发，但本地编程语言创build更快的运行代码。
微型优化 。只有适用你可以使用优化的程序集来加速小块代码，在正确的地方使用SSE /vector优化可以大大提高性能。

分而治之

如果正在处理的数据集太大，则循环使用它的大块。如果你已经完成了你的代码，实现应该很简单。如果你有一个单一的程序，现在你知道更好。

首先，正如前面几个答案中所提到的那样，了解什么会影响你的性能 – 是内存还是处理器，还是networking，数据库还是别的什么东西。取决于…

…如果它是记忆 – find由“计算机程序devise艺术”系列之一Knuth很久以前写的书籍之一。最有可能的是分类和search – 如果我的记忆是错误的，那么你将不得不找出他在谈论如何处理缓慢的磁带数据存储。将他的内存/磁带对分别转换成一对caching/主存（或一对L1 / L2caching）。研究他所描述的所有技巧 – 如果你找不到解决问题的东西，那就聘请专业的计算机科学家进行专业的研究。如果你的记忆问题偶然出现在FFT中（当做基数2的蝴蝶时，在比特反转索引中caching未命中），那么不要聘请科学家 – 而是手动优化一个一个的传递，直到你赢或者得到到死胡同。你提到了最后几个百分点吧？如果确实很less，你很有可能获胜。
…如果是处理器 – 切换到汇编语言。研究处理器规范 – 什么需要蜱 ，VLIW，SIMD。函数调用很可能是可replace的滴答机。学习循环转换 – pipe道，展开。乘法和除法可能是可replace/内插的位移（乘以小整数可能是可replace的加法）。用较短的数据尝试技巧 – 如果幸运的话，一个64位的指令可能会变成32位的两个或者16位的4个或者8位的8位。也尝试更长的数据 – 例如你的浮点运算可能会比在特定处理器上的双精度运算慢。如果你有三angular函数的东西，用预先计算的表格来对抗; 还要记住，如果精度损失在允许范围内，小值的正弦可能会被该值取代。
…如果是networking – 考虑压缩你通过它的数据。用二进制replaceXML传输。研究协议。尝试UDP而不是TCP，如果你能以某种方式处理数据丢失。
…如果是数据库，那么，去任何数据库论坛，并寻求build议。内存数据网格，优化查询计划等

HTH 🙂

我认为这已经以不同的方式说了。但是当你处理一个处理器密集型的algorithm时，你应该简化所有内部循环中的所有内容。

这对一些人来说可能是显而易见的，但无论我使用哪种语言，我都会关注这一点。例如，如果您正在处理嵌套循环，并且您有机会将某些代码降级，那么在某些情况下可以大大加快您的代码速度。另外一个例子，有些事情需要考虑，比如尽可能地使用整数而不是浮点variables，并且尽可能使用乘法而不是分割。再次，这些是你最内在的循环应该考虑的事情。

有时候，你可能会发现在内部循环中的一个整数上执行math运算的好处，然后将其缩小到一个浮点variables，之后你可以使用它。这是一个牺牲速度的例子，以提高另一个速度，但在某些情况下，回报可能是非常值得的。

caching！ （在程序员的努力下）使得几乎所有东西都变得更快的一个便宜的方法是在程序的任何数据移动区域添加一个caching抽象层。无论是I / O还是传递/创build对象或结构。通常很容易将caching添加到工厂类和读者/作者。

有时候caching不会让你获得太多，但是只需要添加caching，然后在没有帮助的地方禁用caching，这是一个简单的方法。我经常发现这样可以获得巨大的性能，而不必微观分析代码。

Very difficult to give a generic answer to this question. It really depends on your problem domain and technical implementation. A general technique that is fairly language neutral: Identify code hotspots that cannot be eliminated, and hand-optimize assembler code.

Last few % is a very CPU and application dependent thing….

cache architectures differ, some chips have on-chip RAM you can map directly, ARM's (sometimes) have a vector unit, SH4's a useful matrix opcode. Is there a GPU – maybe a shader is the way to go. TMS320 's are very sensitive to branches within loops (so separate loops and move conditions outside if possible).

The list goes on…. But these sorts of things really are the last resort…

Build for x86, and run Valgrind /Cachegrind against the code for proper performance profiling. Or Texas Instruments' CCStudio has a sweet profiler. Then you'll really know where to focus…

Did you know that a CAT6 cable is capable of 10x better shielding off extrenal inteferences than a default Cat5e UTP cable?

For any non-offline projects, while having best software and best hardware, if your throughoutput is weak, then that thin line is going to squeeze data and give you delays, albeit in milliseconds… but if you are talking about the last drops, that's a some drops gained, 24/7 for any packge sent or received.

I've spent some time working on optimising client/server business systems operating over low-bandwidth and long-latency networks (eg satellite, remote, offshore), and been able to achieve some dramatic performance improvements with a fairly repeatable process.

Measure : Start by understanding the network's underlying capacity and topology. Talking to the relevant networking people in the business, and make use of basic tools such as ping and traceroute to establish (at a minimum) the network latency from each client location, during typical operational periods. Next, take accurate time measurements of specific end user functions that display the problematic symptoms. Record all of these measurements, along with their locations, dates and times. Consider building end-user "network performance testing" functionality into your client application, allowing your power users to participate in the process of improvement; empowering them like this can have a huge psychological impact when you're dealing with users frustrated by a poorly performing system.
Analyze : Using any and all logging methods available to establish exactly what data is being transmitted and received during the execution of the affected operations. Ideally, your application can capture data transmitted and received by both the client and the server. If these include timestamps as well, even better. If sufficient logging isn't available (eg closed system, or inability to deploy modifications into a production environment), use a network sniffer and make sure you really understand what's going on at the network level.
Cache : Look for cases where static or infrequently changed data is being transmitted repetitively and consider an appropriate caching strategy. Typical examples include "pick list" values or other "reference entities", which can be surprisingly large in some business applications. In many cases, users can accept that they must restart or refresh the application to update infrequently updated data, especially if it can shave significant time from the display of commonly used user interface elements. Make sure you understand the real behaviour of the caching elements already deployed – many common caching methods (eg HTTP ETag) still require a network round-trip to ensure consistency, and where network latency is expensive, you may be able to avoid it altogether with a different caching approach.
Parallelise : Look for sequential transactions that don't logically need to be issued strictly sequentially, and rework the system to issue them in parallel. I dealt with one case where an end-to-end request had an inherent network delay of ~2s, which was not a problem for a single transaction, but when 6 sequential 2s round trips were required before the user regained control of the client application, it became a huge source of frustration. Discovering that these transactions were in fact independent allowed them to be executed in parallel, reducing the end-user delay to very close to the cost of a single round trip.
Combine : Where sequential requests must be executed sequentially, look for opportunities to combine them into a single more comprehensive request. Typical examples include creation of new entities, followed by requests to relate those entities to other existing entities.
Compress : Look for opportunities to leverage compression of the payload, either by replacing a textual form with a binary one, or using actual compression technology. Many modern (ie within a decade) technology stacks support this almost transparently, so make sure it's configured. I have often been surprised by the significant impact of compression where it seemed clear that the problem was fundamentally latency rather than bandwidth, discovering after the fact that it allowed the transaction to fit within a single packet or otherwise avoid packet loss and therefore have an outsize impact on performance.
Repeat : Go back to the beginning and re-measure your operations (at the same locations and times) with the improvements in place, record and report your results. As with all optimisation, some problems may have been solved exposing others that now dominate.

In the steps above, I focus on the application related optimisation process, but of course you must ensure the underlying network itself is configured in the most efficient manner to support your application too. Engage the networking specialists in the business and determine if they're able to apply capacity improvements, QoS, network compression, or other techniques to address the problem. Usually, they will not understand your application's needs, so it's important that you're equipped (after the Analyse step) to discuss it with them, and also to make the business case for any costs you're going to be asking them to incur. I've encountered cases where erroneous network configuration caused the applications data to be transmitted over a slow satellite link rather than an overland link, simply because it was using a TCP port that was not "well known" by the networking specialists; obviously rectifying a problem like this can have a dramatic impact on performance, with no software code or configuration changes necessary at all.

Not nearly as in depth or complex as previous answers, but here goes: (these are more beginner/intermediate level)

obvious: dry
run loops backwards so you're always comparing to 0 rather than a variable
use bitwise operators whenever you can
break repetitive code into modules/functions
cache objects
local variables have slight performance advantage
limit string manipulation as much as possible

Impossible to say. It depends on what the code looks like. If we can assume that the code already exists, then we can simply look at it and figure out from that, how to optimize it.

Better cache locality, loop unrolling, Try to eliminate long dependency chains, to get better instruction-level parallelism. Prefer conditional moves over branches when possible. Exploit SIMD instructions when possible.

Understand what your code is doing, and understand the hardware it's running on. Then it becomes fairly simple to determine what you need to do to improve performance of your code. That's really the only truly general piece of advice I can think of.

Well, that, and "Show the code on SO and ask for optimization advice for that specific piece of code".

If better hardware is an option then definitely go for that. 除此以外

Check you are using the best compiler and linker options.
If hotspot routine in different library to frequent caller, consider moving or cloning it to the callers module. Eliminates some of the call overhead and may improve cache hits (cf how AIX links strcpy() statically into separately linked shared objects). This could of course decrease cache hits also, which is why one measure.
See if there is any possibility of using a specialized version of the hotspot routine. Downside is more than one version to maintain.
Look at the assembler. If you think it could be better, consider why the compiler did not figure this out, and how you could help the compiler.
Consider: are you really using the best algorithm? Is it the best algorithm for your input size?

The google way is one option "Cache it.. Whenever possible don't touch the disk"

Here are some quick and dirty optimization techniques I use. I consider this to be a 'first pass' optimization.

Learn where the time is spent Find out exactly what is taking the time. Is it file IO? Is it CPU time? Is it the network? Is it the Database? It's useless to optimize for IO if that's not the bottleneck.

Know Your Environment Knowing where to optimize typically depends on the development environment. In VB6, for example, passing by reference is slower than passing by value, but in C and C++, by reference is vastly faster. In C, it is reasonable to try something and do something different if a return code indicates a failure, while in Dot Net, catching exceptions are much slower than checking for a valid condition before attempting.

Indexes Build indexes on frequently queried database fields. You can almost always trade space for speed.

Avoid lookups Inside of the loop to be optimized, I avoid having to do any lookups. Find the offset and/or index outside of the loop and reuse the data inside.

Minimize IO try to design in a manner that reduces the number of times you have to read or write especially over a networked connection

Reduce Abstractions The more layers of abstraction the code has to work through, the slower it is. Inside the critical loop, reduce abstractions (eg reveal lower-level methods that avoid extra code)

Spawn Threads for projects with a user interface, spawning a new thread to preform slower tasks makes the application feel more responsive, although isn't.

Pre-process You can generally trade space for speed. If there are calculations or other intense operations, see if you can precompute some of the information before you're in the critical loop.

Sometimes changing the layout of your data can help. In C, you might switch from an array or structures to a structure of arrays, or vice versa.

Tweak the OS and framework.

It may sound an overkill but think about it like this: Operating Systems and Frameworks are designed to do many things. Your application only does very specific things. If you could get the OS do to exactly what your application needs and have your application understand how the the framework (php,.net,java) works, you could get much better out of your hardware.

Facebook, for example, changed some kernel level thingys in Linux, changed how memcached works (for example they wrote a memcached proxy, and used udp instead of tcp ).

Another example for this is Window2008. Win2K8 has a version were you can install just the basic OS needed to run X applicaions (eg Web-Apps, Server Apps). This reduces much of the overhead that the OS have on running processes and gives you better performance.

Of course, you should always throw in more hardware as the first step…

pass by reference instead of by value

Reduce variable sizes (in embedded systems)

If your variable size is larger than the word size on a specific architecture, it can have a significant effect on both code size and speed. For example, if you have a 16 bit system, and use a long int variable very often, and later realize that it can never get outside the range (−32.768 … 32.767) consider reducing it to short int.

From my personal experience, if a program is ready or almost ready, but we realize it takes up about 110% or 120% of the target hardware's program memory, a quick normalization of variables usually solves the problem more often than not.

By this time, optimizing the algorithms or parts of the code itself can become frustratingly futile:

reorganize the whole structure and the program no longer works as intended, or at least you introduce a lot of bugs.
do some clever tricks: usually you spend a lot of time optimizing something, and discover no or very small decrease in code size, as the compiler would have optimized it anyway.

Many people make the mistake of having variables which exactly store the numerical value of a unit they use the variable for: for example, their variable time stores the exact number of milliseconds, even if only time steps of say 50 ms are relevant. Maybe if your variable represented 50 ms for each increment of one, you would be able to fit into a variable smaller or equal to the word size. On an 8 bit system, for example, even a simple addition of two 32-bit variables generates a fair amount of code, especially if you are low on registers, while 8 bit additions are both small and fast.