为什么std :: pair比std :: tuple快

这里是testing的代码。

元组testing:

using namespace std; int main(){ vector<tuple<int,int>> v; for (int var = 0; var < 100000000; ++var) { v.push_back(make_tuple(var, var)); } } 

配对testing:

 #include <vector> using namespace std; int main(){ vector<pair<int,int>> v; for (int var = 0; var < 100000000; ++var) { v.push_back(make_pair(var, var)); } } 

我通过Linux时间命令做了时间测量。 结果是:

 | | -O0 | -O2 | |:------|:-------:|:--------:| | Pair | 8.9 s | 1.60 s | | Tuple | 19.8 s | 1.96 s | 

我想知道,为什么O0这两个数据结构之间有这么大的差别,因为它们应该非常相似。 02有一个小小的差别。

为什么O0的差距那么大,为什么还有差别呢?

编辑:

代码与v.resize()

对:

 #include <vector> using namespace std; int main(){ vector<pair<int,int>> v; v.resize(100000000); for (int var = 0; var < 100000000; ++var) { v[var] = make_pair(var, var); } } 

元组:

 #include<tuple> #include<vector> using namespace std; int main(){ vector<tuple<int,int>> v; v.resize(100000000); for (int var = 0; var < 100000000; ++var) { v[var] = make_tuple(var, var); } } 

结果:

 | | -O0 | -O2 | |:------|:-------:|:--------:| | Pair | 5.01 s | 0.77 s | | Tuple | 10.6 s | 0.87 s | 

编辑:

我的系统

 g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7) GLIBCXX_3.4.19 

你错过了一些重要的信息:你使用什么编译器? 你用什么来衡量microbenchmark的performance? 你使用什么标准的库实现?

我的系统:

 g++ (GCC) 4.9.1 20140903 (prerelease) GLIBCXX_3.4.20 

无论如何,我运行你的例子,但保留了vector的大小,以摆脱内存分配的开销。 有了这个,我可以有趣地观察相反的东西 – 与你所看到的相反:

 g++ -std=c++11 -O2 pair.cpp -o pair perf stat -r 10 -d ./pair Performance counter stats for './pair' (10 runs): 1647.045151 task-clock:HG (msec) # 0.993 CPUs utilized ( +- 1.94% ) 346 context-switches:HG # 0.210 K/sec ( +- 40.13% ) 7 cpu-migrations:HG # 0.004 K/sec ( +- 22.01% ) 182,978 page-faults:HG # 0.111 M/sec ( +- 0.04% ) 3,394,685,602 cycles:HG # 2.061 GHz ( +- 2.24% ) [44.38%] 2,478,474,676 stalled-cycles-frontend:HG # 73.01% frontend cycles idle ( +- 1.24% ) [44.55%] 1,550,747,174 stalled-cycles-backend:HG # 45.68% backend cycles idle ( +- 1.60% ) [44.66%] 2,837,484,461 instructions:HG # 0.84 insns per cycle # 0.87 stalled cycles per insn ( +- 4.86% ) [55.78%] 526,077,681 branches:HG # 319.407 M/sec ( +- 4.52% ) [55.82%] 829,623 branch-misses:HG # 0.16% of all branches ( +- 4.42% ) [55.74%] 594,396,822 L1-dcache-loads:HG # 360.887 M/sec ( +- 4.74% ) [55.59%] 20,842,113 L1-dcache-load-misses:HG # 3.51% of all L1-dcache hits ( +- 0.68% ) [55.46%] 5,474,166 LLC-loads:HG # 3.324 M/sec ( +- 1.81% ) [44.23%] <not supported> LLC-load-misses:HG 1.658671368 seconds time elapsed ( +- 1.82% ) 

与:

 g++ -std=c++11 -O2 tuple.cpp -o tuple perf stat -r 10 -d ./tuple Performance counter stats for './tuple' (10 runs): 996.090514 task-clock:HG (msec) # 0.996 CPUs utilized ( +- 2.41% ) 102 context-switches:HG # 0.102 K/sec ( +- 64.61% ) 4 cpu-migrations:HG # 0.004 K/sec ( +- 32.24% ) 181,701 page-faults:HG # 0.182 M/sec ( +- 0.06% ) 2,052,505,223 cycles:HG # 2.061 GHz ( +- 2.22% ) [44.45%] 1,212,930,513 stalled-cycles-frontend:HG # 59.10% frontend cycles idle ( +- 2.94% ) [44.56%] 621,104,447 stalled-cycles-backend:HG # 30.26% backend cycles idle ( +- 3.48% ) [44.69%] 2,700,410,991 instructions:HG # 1.32 insns per cycle # 0.45 stalled cycles per insn ( +- 1.66% ) [55.94%] 486,476,408 branches:HG # 488.386 M/sec ( +- 1.70% ) [55.96%] 959,651 branch-misses:HG # 0.20% of all branches ( +- 4.78% ) [55.82%] 547,000,119 L1-dcache-loads:HG # 549.147 M/sec ( +- 2.19% ) [55.67%] 21,540,926 L1-dcache-load-misses:HG # 3.94% of all L1-dcache hits ( +- 2.73% ) [55.43%] 5,751,650 LLC-loads:HG # 5.774 M/sec ( +- 3.60% ) [44.21%] <not supported> LLC-load-misses:HG 1.000126894 seconds time elapsed ( +- 2.47% ) 

正如你所看到的,就我而言,原因是在前端以及在后端都有更多的停顿周期。

现在从哪里来? 我敢打赌,它归结为一些失败的内联,类似于这里解释: 启用C ++ 11时,std :: vector的性能回归

事实上,启用-flto为我平衡的结果:

 Performance counter stats for './pair' (10 runs): 1021.922944 task-clock:HG (msec) # 0.997 CPUs utilized ( +- 1.15% ) 63 context-switches:HG # 0.062 K/sec ( +- 77.23% ) 5 cpu-migrations:HG # 0.005 K/sec ( +- 34.21% ) 195,396 page-faults:HG # 0.191 M/sec ( +- 0.00% ) 2,109,877,147 cycles:HG # 2.065 GHz ( +- 0.92% ) [44.33%] 1,098,031,078 stalled-cycles-frontend:HG # 52.04% frontend cycles idle ( +- 0.93% ) [44.46%] 701,553,535 stalled-cycles-backend:HG # 33.25% backend cycles idle ( +- 1.09% ) [44.68%] 3,288,420,630 instructions:HG # 1.56 insns per cycle # 0.33 stalled cycles per insn ( +- 0.88% ) [55.89%] 672,941,736 branches:HG # 658.505 M/sec ( +- 0.80% ) [56.00%] 660,278 branch-misses:HG # 0.10% of all branches ( +- 2.05% ) [55.93%] 474,314,267 L1-dcache-loads:HG # 464.139 M/sec ( +- 1.32% ) [55.73%] 19,481,787 L1-dcache-load-misses:HG # 4.11% of all L1-dcache hits ( +- 0.80% ) [55.51%] 5,155,678 LLC-loads:HG # 5.045 M/sec ( +- 1.69% ) [44.21%] <not supported> LLC-load-misses:HG 1.025083895 seconds time elapsed ( +- 1.03% ) 

和元组:

 Performance counter stats for './tuple' (10 runs): 1018.980969 task-clock:HG (msec) # 0.999 CPUs utilized ( +- 0.47% ) 8 context-switches:HG # 0.008 K/sec ( +- 29.74% ) 3 cpu-migrations:HG # 0.003 K/sec ( +- 42.64% ) 195,396 page-faults:HG # 0.192 M/sec ( +- 0.00% ) 2,103,574,740 cycles:HG # 2.064 GHz ( +- 0.30% ) [44.28%] 1,088,827,212 stalled-cycles-frontend:HG # 51.76% frontend cycles idle ( +- 0.47% ) [44.56%] 697,438,071 stalled-cycles-backend:HG # 33.15% backend cycles idle ( +- 0.41% ) [44.76%] 3,305,631,646 instructions:HG # 1.57 insns per cycle # 0.33 stalled cycles per insn ( +- 0.21% ) [55.94%] 675,175,757 branches:HG # 662.599 M/sec ( +- 0.16% ) [56.02%] 656,205 branch-misses:HG # 0.10% of all branches ( +- 0.98% ) [55.93%] 475,532,976 L1-dcache-loads:HG # 466.675 M/sec ( +- 0.13% ) [55.69%] 19,430,992 L1-dcache-load-misses:HG # 4.09% of all L1-dcache hits ( +- 0.20% ) [55.49%] 5,161,624 LLC-loads:HG # 5.065 M/sec ( +- 0.47% ) [44.14%] <not supported> LLC-load-misses:HG 1.020225388 seconds time elapsed ( +- 0.48% ) 

所以请记住, -flto是你的朋友,失败的内联可以在严重的模板代码上产生极端的结果。 使用perf stat来找出发生的事情。

milianw没有解决-O0-O2 ,所以我想补充说明一下。

完全可以预料, std::tuple优化时会比std::pair慢,因为它是比较复杂的对象。 一对正好有两个成员,所以它的方法可以直接定义。 但是元组有任意数量的成员,遍历模板参数列表的唯一方法是recursion。 因此,元组中的大部分函数处理一个成员,然后recursion处理其余部分,所以对于2元组,您有两倍的函数调用。

现在,当它们优化时,编译器将内联该recursion,并且不应该有显着差异。 哪些testing明确证实。 这一般适用于大量模板化的东西。 可以编写模板来提供没有运行开销或运行时开销很小的抽象,但是依赖于优化来embedded所有微不足道的函数。