为什么单独循环中的元素添加比组合循环中快得多？

` `const int n = 100000; for (int j = 0; j < n; j++) { a1[j] += b1[j]; c1[j] += d1[j]; }` `

` `for (int j = 0; j < n; j++) { a1[j] += b1[j]; } for (int j = 0; j < n; j++) { c1[j] += d1[j]; }` `

PS：我不确定，如果这有帮助：

` `movsd xmm0,mmword ptr [edx+18h] addsd xmm0,mmword ptr [ecx+20h] movsd mmword ptr [ecx+20h],xmm0 movsd xmm0,mmword ptr [esi+10h] addsd xmm0,mmword ptr [eax+30h] movsd mmword ptr [eax+30h],xmm0 movsd xmm0,mmword ptr [edx+20h] addsd xmm0,mmword ptr [ecx+28h] movsd mmword ptr [ecx+28h],xmm0 movsd xmm0,mmword ptr [esi+18h] addsd xmm0,mmword ptr [eax+38h]` `

` `addsd xmm0,mmword ptr [eax+28h] movsd mmword ptr [eax+28h],xmm0 movsd xmm0,mmword ptr [ecx+20h] addsd xmm0,mmword ptr [eax+30h] movsd mmword ptr [eax+30h],xmm0 movsd xmm0,mmword ptr [ecx+28h] addsd xmm0,mmword ptr [eax+38h] movsd mmword ptr [eax+38h],xmm0 movsd xmm0,mmword ptr [ecx+30h] addsd xmm0,mmword ptr [eax+40h] movsd mmword ptr [eax+40h],xmm0` `

PPS：这是完整的代码 。 它使用TBB Tick_Count来获得更高分辨率的时序，这可以通过不定义TBB_TIMING宏来禁用：

` `#include <iostream> #include <iomanip> #include <cmath> #include <string> //#define TBB_TIMING #ifdef TBB_TIMING #include <tbb/tick_count.h> using tbb::tick_count; #else #include <time.h> #endif using namespace std; //#define preallocate_memory new_cont enum { new_cont, new_sep }; double *a1, *b1, *c1, *d1; void allo(int cont, int n) { switch(cont) { case new_cont: a1 = new double[n*4]; b1 = a1 + n; c1 = b1 + n; d1 = c1 + n; break; case new_sep: a1 = new double[n]; b1 = new double[n]; c1 = new double[n]; d1 = new double[n]; break; } for (int i = 0; i < n; i++) { a1[i] = 1.0; d1[i] = 1.0; c1[i] = 1.0; b1[i] = 1.0; } } void ff(int cont) { switch(cont){ case new_sep: delete[] b1; delete[] c1; delete[] d1; case new_cont: delete[] a1; } } double plain(int n, int m, int cont, int loops) { #ifndef preallocate_memory allo(cont,n); #endif #ifdef TBB_TIMING tick_count t0 = tick_count::now(); #else clock_t start = clock(); #endif if (loops == 1) { for (int i = 0; i < m; i++) { for (int j = 0; j < n; j++){ a1[j] += b1[j]; c1[j] += d1[j]; } } } else { for (int i = 0; i < m; i++) { for (int j = 0; j < n; j++) { a1[j] += b1[j]; } for (int j = 0; j < n; j++) { c1[j] += d1[j]; } } } double ret; #ifdef TBB_TIMING tick_count t1 = tick_count::now(); ret = 2.0*double(n)*double(m)/(t1-t0).seconds(); #else clock_t end = clock(); ret = 2.0*double(n)*double(m)/(double)(end - start) *double(CLOCKS_PER_SEC); #endif #ifndef preallocate_memory ff(cont); #endif return ret; } void main() { freopen("C:\\test.csv", "w", stdout); char *s = " "; string na[2] ={"new_cont", "new_sep"}; cout << "n"; for (int j = 0; j < 2; j++) for (int i = 1; i <= 2; i++) #ifdef preallocate_memory cout << s << i << "_loops_" << na[preallocate_memory]; #else cout << s << i << "_loops_" << na[j]; #endif cout << endl; long long nmax = 1000000; #ifdef preallocate_memory allo(preallocate_memory, nmax); #endif for (long long n = 1L; n < nmax; n = max(n+1, long long(n*1.2))) { const long long m = 10000000/n; cout << n; for (int j = 0; j < 2; j++) for (int i = 1; i <= 2; i++) cout << s << plain(n, m, j, i); cout << endl; } }` `

（它显示不同的`n`值的FLOP / s）

` `int main(){ const int n = 100000; #ifdef ALLOCATE_SEPERATE double *a1 = (double*)malloc(n * sizeof(double)); double *b1 = (double*)malloc(n * sizeof(double)); double *c1 = (double*)malloc(n * sizeof(double)); double *d1 = (double*)malloc(n * sizeof(double)); #else double *a1 = (double*)malloc(n * sizeof(double) * 4); double *b1 = a1 + n; double *c1 = b1 + n; double *d1 = c1 + n; #endif // Zero the data to prevent any chance of denormals. memset(a1,0,n * sizeof(double)); memset(b1,0,n * sizeof(double)); memset(c1,0,n * sizeof(double)); memset(d1,0,n * sizeof(double)); // Print the addresses cout << a1 << endl; cout << b1 << endl; cout << c1 << endl; cout << d1 << endl; clock_t start = clock(); int c = 0; while (c++ < 10000){ #if ONE_LOOP for(int j=0;j<n;j++){ a1[j] += b1[j]; c1[j] += d1[j]; } #else for(int j=0;j<n;j++){ a1[j] += b1[j]; } for(int j=0;j<n;j++){ c1[j] += d1[j]; } #endif } clock_t end = clock(); cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl; system("pause"); return 0; }` `

编辑：在实际的 Core 2架构机器上的结果：

2个Intel Xeon X5482 Harpertown @ 3.2 GHz：

` `#define ALLOCATE_SEPERATE #define ONE_LOOP 00600020 006D0020 007A0020 00870020 seconds = 6.206 #define ALLOCATE_SEPERATE //#define ONE_LOOP 005E0020 006B0020 00780020 00850020 seconds = 2.116 //#define ALLOCATE_SEPERATE #define ONE_LOOP 00570020 00633520 006F6A20 007B9F20 seconds = 1.894 //#define ALLOCATE_SEPERATE //#define ONE_LOOP 008C0020 00983520 00A46A20 00B09F20 seconds = 1.993` `

• 一个循环6.206秒 ，两个循环2.116秒 。 这再现了OP的结果。

• 在前两个测试中，数组是分开分配的。 你会注意到它们都与页面有相同的对齐方式。

• 在后面的两个测试中，数组被打包在一起以打破对齐。 在这里你会注意到两个循环都更快。 此外，第二个（双）循环现在是你通常所期望的较慢的循环。

5个地区 – 解释

@ Mysticial的回答让很多人（包括我）相信了，可能是因为它是唯一依靠事实的人，但这只是真相的一个“数据点”。

1. 它具有一两个循环版本之间最大的差异（几乎是三分之一）

2. 这是唯一的一个循环（即连续分配）击败双循环版本。 （这使得Mysticial的答案成为可能。）

提案

` `for(int j=0;j<n;j++){ a1[j] += b1[j]; } for(int j=0;j<n;j++){ c1[j] += d1[j]; }` `

` `for(int j=0;j<n;j++){ a1[j] += b1[j]; c1[j] += d1[j]; }` `

`n = 2` ，我们正在使用字节。 在我的情况下，我们因此只有4个字节的缓存 ，其余的内存明显更慢（比如访问时间延长了100倍）。

• ` `for(int j=0;j<n;j++){ a1[j] += b1[j]; } for(int j=0;j<n;j++){ c1[j] += d1[j]; }` `
• 在高速缓存中缓存`a1[0]``a1[1]`然后`b1[0]``b1[1]`并设置`a1[0] = a1[0] + b1[0]` – 现在缓存中有四个字节， `a1[0], a1[1]``b1[0], b1[1]` 。 成本= 100 + 100。

• 在缓存中设置`a1[1] = a1[1] + b1[1]` 。 成本= 1 + 1。
• 重复`c1`和`d1。
• 总成本= `(100 + 100 + 1 + 1) * 2 = 404`

• ` `for(int j=0;j<n;j++){ a1[j] += b1[j]; c1[j] += d1[j]; }` `
• 在高速缓存中缓存`a1[0]``a1[1]`然后`b1[0]``b1[1]`并设置`a1[0] = a1[0] + b1[0]` – 现在缓存中有四个字节， `a1[0], a1[1]``b1[0], b1[1]` 。 成本= 100 + 100。

• 从缓存中缓存`c1[0]``c1[1]`然后`d1[0]``d1[1]` ，并将`c1[0] = c1[0] + d1[0]` `a1[0], a1[1], b1[0], b1[1]`从缓存中取出`a1[0], a1[1], b1[0], b1[1]`缓存中的`c1[0] = c1[0] + d1[0]` 。 成本= 100 + 100。
• 我怀疑你已经开始看到我要去的地方了。
• 总成本= `(100 + 100 + 100 + 100) * 2 = 800`

PS：我改变了循环倒数到零，而组合方法稍微快一点。 抓我的头。 请注意新的数组大小和循环次数。

` `// MemBufferMystery.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include <iostream> #include <cmath> #include <string> #include <time.h> #define dbl double #define MAX_ARRAY_SZ 262145 //16777216 // AKA (2^24) #define STEP_SZ 1024 // 65536 // AKA (2^16) int _tmain(int argc, _TCHAR* argv[]) { long i, j, ArraySz = 0, LoopKnt = 1024; time_t start, Cumulative_Combined = 0, Cumulative_Separate = 0; dbl *a = NULL, *b = NULL, *c = NULL, *d = NULL, *InitToOnes = NULL; a = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl)); b = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl)); c = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl)); d = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl)); InitToOnes = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl)); // Initialize array to 1.0 second. for(j = 0; j< MAX_ARRAY_SZ; j++) { InitToOnes[j] = 1.0; } // Increase size of arrays and time for(ArraySz = STEP_SZ; ArraySz<MAX_ARRAY_SZ; ArraySz += STEP_SZ) { a = (dbl *)realloc(a, ArraySz * sizeof(dbl)); b = (dbl *)realloc(b, ArraySz * sizeof(dbl)); c = (dbl *)realloc(c, ArraySz * sizeof(dbl)); d = (dbl *)realloc(d, ArraySz * sizeof(dbl)); // Outside the timing loop, initialize // b and d arrays to 1.0 sec for consistent += performance. memcpy((void *)b, (void *)InitToOnes, ArraySz * sizeof(dbl)); memcpy((void *)d, (void *)InitToOnes, ArraySz * sizeof(dbl)); start = clock(); for(i = LoopKnt; i; i--) { for(j = ArraySz; j; j--) { a[j] += b[j]; c[j] += d[j]; } } Cumulative_Combined += (clock()-start); printf("\n %6i miliseconds for combined array sizes %i and %i loops", (int)(clock()-start), ArraySz, LoopKnt); start = clock(); for(i = LoopKnt; i; i--) { for(j = ArraySz; j; j--) { a[j] += b[j]; } for(j = ArraySz; j; j--) { c[j] += d[j]; } } Cumulative_Separate += (clock()-start); printf("\n %6i miliseconds for separate array sizes %i and %i loops \n", (int)(clock()-start), ArraySz, LoopKnt); } printf("\n Cumulative combined array processing took %10.3f seconds", (dbl)(Cumulative_Combined/(dbl)CLOCKS_PER_SEC)); printf("\n Cumulative seperate array processing took %10.3f seconds", (dbl)(Cumulative_Separate/(dbl)CLOCKS_PER_SEC)); getchar(); free(a); free(b); free(c); free(d); free(InitToOnes); return 0; }` `

OP的代码：

` `const int n=100000; for(int j=0;j<n;j++){ a1[j] += b1[j]; c1[j] += d1[j]; }` `

` `for(int j=0;j<n;j++){ a1[j] += b1[j]; } for(int j=0;j<n;j++){ c1[j] += d1[j]; }` `

• 我们将让我们的循环和它的迭代成为从1开始到100000结束而不是以0开始的循环，因为我们不需要担心内存寻址的0索引方案，因为我们只是感兴趣算法本身。
• 在这两种情况下，我们都有4个函数和2个函数调用，每个函数调用有2个操作。 所以我们将这些函数和函数调用设置为`F1()``F2()``f(a)``f(b)``f(c)``f(d)`

` `Sum n=1 : [1,100000] = F1(), F2(); F1() = { f(a) = f(a) + f(b); } F2() = { f(c) = f(c) + f(d); }` `

` `Sum1 n=1 : [1,100000] = F1(); F1() = { f(a) = f(a) + f(b); } Sum2 n=1 : [1,100000] = F1(); F1() = { f(c) = f(c) + f(d); }` `

The iterations through the first case `Sum` calls `f(a)` that will add to its self `f(b)` then it calls `f(c)` that will do the same but add `f(d)` to itself for each `100000 iterations` . In the second case we have `Sum1` and `Sum2` And both act the same as if they were the same function being called twice in a row. In this case we can treat `Sum1` and `Sum2` as just plain old `Sum` where `Sum` in this case looks like this: `Sum n=1 : [1,100000] { f(a) = f(a) + f(b); }` and now this looks like an optimization where we can just consider it to be the same function.

Summary with Analogy

With what we seen in the second case it almost appears as if there is optimization since both for loops have the same exact signature, but this isn't the real issue. The issue isn't the work that is being done by `f(a)` , `f(b)` , `f(c)` & `f(d)` in both cases and the comparison between the two it is the difference in the distance that the Summation has to travel in both cases that gives you the difference in time execution.

Think of the `For Loops` as being the `Summations` that does the iterations as being a `Boss` that is giving orders to two people `A` & `B` and that their jobs are to meat `C` & `D` respectively and to pick up some package from them and return it. In the analogy here the for loop or summation iterations and condition checks themselves doesn't actually represent the `Boss` . What actually represents the `Boss` here is not from the actual mathematical algorithms directly, but from the actual concept of `Scope` and `Code Block` within a routine or sub-routine, method, function, translation unit, etc. The first algorithm has 1 scope where the 2nd algorithm has 2 consecutive scopes.

Within the first case on each call slip the `Boss` goes to `A` and gives the order and `A` goes off to fetch `B's` package then the `Boss` goes to `C` and gives the orders to do the same and receive the package from `D` on each iteration.

Within the second case the `Boss` works directly with `A` to go and fetch `B's` package until all packages are received. Then the `Boss` works with `C` to do the same for getting all of `D's` packages.

Since we are working with an 8 byte pointer and dealing with Heap allocation let's consider this problem here. Let us say that the `Boss` is 100 feet from `A` and that `A` is 500 feet from `C` . We don't need to worry about how far the `Boss` is initially from `C` because of the order of executions. In both cases the `Boss` initially travels from `A` first then to `B` . This analogy isn't to say that this distance is exact; it is just a use test case scenario to show the workings of the algorithms. In many cases when doing heap allocations and working with the cache and page files, these distances between address locations may not vary that much in differences or they can very significantly depending on the nature of the data types and the array sizes.

The Test Cases:

First Case: On first iteration the `Boss` has to initially go 100 feet to give the order slip to `A` and `A` goes off and does his thing, but then the `Boss` has to travel 500 feet to `C` to give him his order slip. Then on the next iteration and every other iteration after the `Boss` has to go back and forth 500 feet between the two.

Second Case: `The Boss` has to travel 100 feet on the first iteration to `A` , but after that he is already there and just waits for `A` to get back until all slips are filled. Then the `Boss` has to travel 500 feet on the first iteration to `C` because `C` is 500 feet from `A` since this `Boss( Summation, For Loop )` is being called right after working with `A` and then just waits like he did with `A` until all of `C's` order slips are done.

The Difference In Distances Traveled

` `const n = 100000 distTraveledOfFirst = (100 + 500) + ((n-1)*(500 + 500); // Simplify distTraveledOfFirst = 600 + (99999*100); distTraveledOfFirst = 600 + 9999900; distTraveledOfFirst = 10000500; // Distance Traveled On First Algorithm = 10,000,500ft distTraveledOfSecond = 100 + 500 = 600; // Distance Traveled On Second Algorithm = 600ft;` `

The Comparison of Arbitrary Values

We can easily see that 600 is far less than 10 million. Now this isn't exact, because we don't know the actual difference in distance between which address of RAM or from which Cache or Page File each call on each iteration is going to be due to many other unseen variables, but this is just an assessment of the situation to be aware of and trying to look at it from the worst case scenario.

So by these numbers it would almost look as if Algorithm One should be 99% slower than Algorithm Two; however, this is only the `The Boss's` part or responsibility of the algorithms and it doesn't account for the actual workers `A` , `B` , `C` , & `D` and what they have to do on each and every iteration of the Loop. So the bosses job only accounts for about 15 – 40% of the total work being done. So the bulk of the work which is done through the workers has a slight bigger impact towards keeping the ratio of the speed rate differences to about 50-70%

The Observation:The differences between the two algorithms

In this situation it is the structure of the process of the work being done and it does go to show that Case 2 is more efficient from both that partial optimization of having a similar function declaration and definition where it is only the variables that differ by name. And we also see that the total distance traveled in Case 1 is much farther than it is in Case 2 and we can consider this distance traveled our Time Factor between the two algorithms. Case 1 has considerable more work to do than Case 2 does. This was also seen in the evidence of the `ASM` that was shown between both cases. Even with what was already said about these cases, it also doesn't account for the fact that in Case 1 the boss will have to wait for both `A` & `C` to get back before he can go back to `A` again on the next iteration and it also doesn't account for the fact that if `A` or `B` is taking an extremely long time then both the `Boss` and the other worker(s) are also waiting at an idle. In Case 2 the only one being idle is the `Boss` until the worker gets back. So even this has an impact on the algorithm.

Case 1 is a classic interpolation problem that happens to be an inefficient one. I also think that this was one of the leading reasons of why many machine architectures and developers ended up building and designing multi-core systems with the ability to do multi-threaded applications as well as parallel programming.

So even looking at it from this approach without even involving how the Hardware, OS and Compiler works together to do heap allocations that involves working with RAM, Cache, Page Files, etc.; the mathematics behind it already shows us which of these two is the better solution from using the above analogy where the `Boss` or the `Summations` being those the `For Loops` that had to travel between Workers `A` & `B` . We can easily see that Case 2 is at least as 1/2 as fast if not a little more than Case 1 due to the difference in the distanced traveled and the time taken. And this math lines up almost virtually and perfectly with both the Bench Mark Times as well as the Amount of difference in the amount of Assembly Instructions.

The OPs Amended Question(s)

EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.

Regarding These Questions

As I have demonstrated without a doubt, there is an underlying issue even before the Hardware and Software becomes involved. Now as for the management of memory and caching along with page files, etc. which all works together in an integrated set of systems between: `The Architecture` { Hardware, Firmware, some Embedded Drivers, Kernels and ASM Instruction Sets }, `The OS` { File and Memory Management systems, Drivers and the Registry }, `The Compiler` { Translation Units and Optimizations of the Source Code }, and even the `Source Code` itself with its set(s) of distinctive algorithms; we can already see that there is a bottleneck that is happening within the first algorithm before we even apply it to any machine with any arbitrary `Architecture` , `OS` , and `Programmable Language` compared to the second algorithm. So there already existed a problem before involving the intrinsics of a modern computer.

The Ending Results

Now if the "Data" set is fairly small it may not seem all that bad of a difference at first but since `Case 1` is about `60 - 70%` slower than `Case 2` we can look at the growth of this function as being in terms of the differences in time executions:

` `DeltaTimeDifference approximately = Loop1(time) - Loop2(time) //where Loop1(time) = Loop2(time) + (Loop2(time)*[0.6,0.7]) // approximately // So when we substitute this back into the difference equation we end up with DeltaTimeDifference approximately = (Loop2(time) + (Loop2(time)*[0.6,0.7])) - Loop2(time) // And finally we can simplify this to DeltaTimeDifference approximately = [0.6,0.7]*(Loop2(time)` `

And this approximation is the average difference between these two loops both algorithmically and machine operations involving software optimizations and machine instructions. So when the data set grows linearly, so does the difference in time between the two. Algorithm 1 has more fetches than algorithm 2 which is evident when the `Boss` had to travel back and forth the maximum distance between `A` & `C` for every iteration after the first iteration while Algorithm 2 the `Boss` had to travel to `A` once and then after being done with `A` he had to travel a maximum distance only one time when going from `A` to `C` .

So trying to have the `Boss` focusing on doing two similar things at once and juggling them back and forth instead of focusing on similar consecutive tasks is going to make him quite angry by the end of the day because he had to travel and work twice as much. Therefor do not lose the scope of the situation by letting your boss getting into an interpolated bottleneck because the boss's spouse and children wouldn't appreciate it.