访问各种caching和主内存的大概成本?

任何人都可以给我大致的时间(以纳秒为单位)来访问L1,L2和L3caching,以及英特尔i7处理器上的主内存?

虽然这不是特别的编程问题,但知道这些速度细节对于一些低延迟编程挑战来说是必要的。


EDIT
戴夫的第二个链接提供了以下数字:

 Core i7 Xeon 5500 Series Data Source Latency (approximate) [Pg. 22] local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns ) local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns ) local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns ) local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns ) local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns ) remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns ) local DRAM ~60 ns remote DRAM ~100 ns 

EDIT2
最重要的是引用表中的通知,说:

“注意:这些值是粗略的近似值,它们取决于内核的频率,频率,内存速度,BIOS设置,DIMM数量等等, 您的年龄可能有所不同。

以下是针对i7和Xeon系列处理器的性能分析指南 。 我应该强调,这有你所需要的和更多的(例如,检查第22页的一些时间和周期,例如)。

另外, 这个页面还有一些关于时钟周期的细节

编辑:我应该强调,以及时间/周期信息,上述英特尔文件处理器(从性能的angular度来看)的i7和Xeon系列的更多(非常)有用的细节。

大家应该知道的数字

  0.5 ns - CPU L1 dCACHE reference 1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance 5 ns - CPU L1 iCACHE Branch mispredict 7 ns - CPU L2 CACHE reference 71 ns - CPU cross-QPI/NUMA best case on XEON E5-46* 100 ns - MUTEX lock/unlock 100 ns - own DDR MEMORY reference 135 ns - CPU cross-QPI/NUMA best case on XEON E7-* 202 ns - CPU cross-QPI/NUMA worst case on XEON E7-* 325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46* 10,000 ns - Compress 1K bytes with Zippy PROCESS 20,000 ns - Send 2K bytes over 1 Gbps NETWORK 250,000 ns - Read 1 MB sequentially from MEMORY 500,000 ns - Round trip within a same DataCenter 10,000,000 ns - DISK seek 10,000,000 ns - Read 1 MB sequentially from NETWORK 30,000,000 ns - Read 1 MB sequentially from DISK 150,000,000 ns - Send a NETWORK packet CA -> Netherlands | | | | | | | ns| | | us| | ms| 

来自:最初由Peter Norvig:
http://norvig.com/21-days.html#answers
http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/
http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine

一个视觉比较

访问一个漂亮的页面中的各种记忆的成本

  • 看到这个页面介绍了从1990年到2020年内存延迟减less

概要

  1. 自2005年以来,价值有所下降但趋于稳定

      1 ns L1 cache 3 ns Branch mispredict 4 ns L2 cache 17 ns Mutex lock/unlock 100 ns Main memory (RAM) 2 000 ns (2µs) 1KB Zippy-compress 
  2. 还有一些改进,预测到2020年

      16 000 ns (16µs) SSD random read (olibre's note: should be less) 500 000 ns (½ms) Round trip in datacenter 2 000 000 ns (2ms) HDD random read (seek) 

另见其他来源

  • 每个程序员应该从Ulrich Drepper(2007)
    陈旧的,但仍然是关于内存硬件和软件交互的很好的深入解释。
    • 完整PDF (114页)
      • 关于PDF版本的LWN的评论
      • 另一个
    • LWN上的七个post+评论
      • 第一部分 – 介绍
      • 第2部分 – caching
      • 第3部分 – 虚拟内存
      • 第4部分 – NUMA支持
      • 第5部分 – 程序员可以做什么
      • 第6部分 – 程序员可以做更多的事情
      • 第7部分 – 内存性能工具
  • 发表在编码horror.com的单词之间的无限空间基于书系统性能:企业和云
  • 点击http://www.7-cpu.com/上列出的每个处理器,查看L1 / L2 / L3 / RAM / …延迟(例如, Haswell i7-4770的 L1 = 1ns,L2 = 3ns,L3 = 10ns,RAM = 67ns,BranchMisprediction = 4ns)
  • http://idarkside.org/posts/numbers-you-should-know/

另请参阅培训

为了进一步理解,我推荐Gerhard Wellein , Ernesge -Nürnberg大学的 Hannes Hofmann和Dietmar Fey 对现代caching架构 (2014年6月)的精彩介绍 。

只是为了2015年对2020年预测的回顾:

 Still some improvements, prediction for 2020 (Ref. olibre's answer below) ------------------------------------------------------------------------- 16 000 ns ( 16 µs) SSD random read (olibre's note: should be less) 500 000 ns ( ½ ms) Round trip in datacenter 2 000 000 ns ( 2 ms) HDD random read (seek) In 2015 there are currently available: ======================================================================== 820 ns ( 0.8µs) random read from a SSD-DataPlane 1 200 ns ( 1.2µs) Round trip in datacenter 1 200 ns ( 1.2µs) random read from a HDD-DataPlane 

只是为了CPU和GPU延迟景观比较:

即使在最简单的CPU /caching/ DRAM阵容(即使在统一的内存访问模型中),DRAM速度是决定延迟的一个因素,以及加载的延迟(饱和系统),也不是一件容易的事情。企业应用程序将体验的不仅仅是闲置的完全卸载的系统。

  +----------------------------------- 5,6,7,8,9,..12,15,16 | +--- 1066,1333,..2800..3300 vv First word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate ^----------------------- 7x .. difference ******************************** So: === resulting DDR3-side latencies are between _____________ 3.03 ns ^ | 36.58 ns ___v_ based on DDR3 HW facts 

统一的内存访问

GPU引擎已经获得了大量的技术营销,而深厚的内部依赖关系是理解真实优势以及这些架构在实践中所经历的真正弱点(通常与积极的营销高估期望大不相同)的关键。

  1 ns _________ LETS SETUP A TIME/DISTANCE SCALE FIRST: ° ^ |\ |a 1 ft-distance a foton travels in vacuum ( less in dark-fibre ) | \ | | \ | __|___\__v____________________________________________________ | | |<-->| a 1 ns TimeDOMAIN "distance", before a foton arrived | | ^ v DATA | |DATA RQST'd| |RECV'd ( DATA XFER/FETCH latency ) 25 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor REGISTER access 35 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor L1-onHit-[--8kB]CACHE 70 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor SHARED-MEM access 230 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor texL1-onHit-[--5kB]CACHE 320 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor texL2-onHit-[256kB]CACHE 350 ns 700 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor GLOBAL-MEM access - - - - - 

因此,理解内在性比其他领域更为重要,因为在这些领域,架构已经发布,许多基准可以免费获得。 非常感谢GPU微型testing人员,他们花费了大量时间和创造力,在经过黑盒testing的GPU器件内释放真实的工作scheme。

  +====================| + 11-12 [usec] XFER-LATENCY-up HostToDevice ~~~ same as Intel X48 / nForce 790i | 

+ 10-11 [usec] XFER-LATENCY-down DeviceToHost |

~ 5.5 GB/sec XFER-BW-up ~~~ same as DDR2/DDR3 throughput |

~ 5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3] | | Host-side | cudaHostRegister( void *ptr, size_t size, unsigned int flags ) | | +————– cudaHostRegisterPortable — marks memory as PINNED MEMORY for all CUDA Contexts, not just the one, current, when the allocation was performed | ___HostAllocWriteCombined_MEM / cudaHostFree() +—————- cudaHostRegisterMapped — maps memory allocation into the CUDA address space ( the Device pointer can be obtained by a call to cudaHostGetDevicePointer( void **pDevice, void *pHost, unsigned int flags=0 ); ) | ___HostRegisterPORTABLE___MEM / cudaHostUnregister( void *ptr ) |

|

| | PCIe-2.0 ( 4x) | ~ 4 GB/s over 4-Lanes ( PORT #2 ) | | PCIe-2.0 ( 8x) | ~16 GB/s over 8-Lanes | | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x ) | | + PCIe-3.0 25-port 97-lanes non-blocking SwitchFabric … +over copper/fiber | ~~~ The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. [I:] | | ~810 [ns] + InRam-"Network" / many-to-many parallel CPU/Memory "message" passing with less than 810 ns latency any-to-any | |

|

+====================| |.pci…………HOST|


我对“更大的图像”表示歉意,但延迟的解蔽也是片上smREG / L1 / L2容量和命中率的主要限制。

  |.pci............GPU.| | | FERMI [GPU-CLK] ~ 0.9 [ns] but THE I/O LATENCIES PAR -- 

<800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!] | ^^^^^^^^|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [!!] | smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__ | +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^ | | ^^^^^^^^ | +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU | | ^^^^^^^^ | ~ +20 [ns] @1147 MHz FERMI ^^^^^^^^ | SM-REGISTERs/thread: max 63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46 | max 63 for CC-3.0 – about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73 | max 128 for CC-1.x PAR —

||~~~| | max 255 for CC-3.5 PAR —

~~~~~~| | | smREGs___BW ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE << -Xptxas -v || nvcc -maxrregcount ( w|w/o spillover(s) ) | with about 8.0 TB/s BW [C:Pg.46] | 1.3 TB/s BW shaMEM___ 4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI | 0.1 TB/s BW gloMEM___ | ________________________________________________________________________________________________________________________________________________________________________________________________________________________ +========| DEVICE:3 PERSISTENT gloMEM___ | _|______________________________________________________________________________________________________________________________________________________________________________________________________________________ +======| DEVICE:2 PERSISTENT gloMEM___ | _|______________________________________________________________________________________________________________________________________________________________________________________________________________________ +====| DEVICE:1 PERSISTENT gloMEM___ | _|______________________________________________________________________________________________________________________________________________________________________________________________________________________ +==| DEVICE:0 PERSISTENT gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB| ! | |\ + | o | texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB| | |\ \ |\ + |\ | | texL2cache_| \ \ .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \ 256_KB| | | \ \ | \ + |\ ^ \ | | | \ \ | \ + | \ ^ \ | | | \ \ | \ + | \ ^ \ | | texL1cache_| \ \ .| \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ | \_ _ _ _ _^ \ 5_KB| | | \ \ | \ + ^\ ^ \ ^\ \ | | shaMEM + conL3cache_| \ \ | \ _ _ _ _ conL3cache +220 [GPU_CLKs] ^ \ ^ \ ^ \ \ 32_KB| | | \ \ | \ ^\ + ^ \ ^ \ ^ \ \ | | | \ \ | \ ^ \ + ^ \ ^ \ ^ \ \ | | ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________| | +220 [GPU-CLKs]_| |_ _ _ ___|\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache | L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB L2_|_ _ _ __|\\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache | L1-on-re-use-only +40 [GPU-CLKs]_| 8 KB L1_|_ _ _ _|\\\ \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________| | L1-on-re-use-only + 8 [GPU-CLKs]_| 2 KB L1_|__________|\\\\__________\_\__________________________________\________\____+ 8 [GPU_CLKs]_________________________________________________________conL1cache 2_KB| | on-chip|smREG +22 [GPU-CLKs]_| |t[0_______^:~~~~~~~~~~~~~~~~\:________] |CC- MAX |_|_|_|_|_|_|_|_|_|_|_| |t[1_______^ :________] |2.x 63 |_|_|_|_|_|_|_|_|_|_|_| |t[2_______^ :________] |1.x 128 |_|_|_|_|_|_|_|_|_|_|_| |t[3_______^ :________] |3.5 255 REGISTERs|_|_|_|_|_|_|_|_| |t[4_______^ :________] | per|_|_|_|_|_|_|_|_|_|_|_| |t[5_______^ :________] | Thread_|_|_|_|_|_|_|_|_|_| |t[6_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[7_______^ 1stHalf-WARP :________]______________ | |_|_|_|_|_|_|_|_|_|_|_| |t[ 8_______^:~~~~~~~~~~~~~~~~~:________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ 9_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ A_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ B_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ C_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ D_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| |t[ E_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| W0..|t[ F_______^____________WARP__:________]_____________ | |_|_|_|_|_|_|_|_|_|_|_| ………….. | |_|_|_|_|_|_|_|_|_|_|_| …………|t[0_______^:~~~~~~~~~~~~~~~\:________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[1_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[2_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[3_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[4_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[5_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[6_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[7_______^ 1stHalf-WARP :________]______________ | |_|_|_|_|_|_|_|_|_|_|_| …………|t[ 8_______^:~~~~~~~~~~~~~~~~:________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[ 9_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[ A_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[ B_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[ C_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[ D_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| …………|t[ E_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| W1…………..|t[ F_______^___________WARP__:________]_____________ | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………. | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[0_______^:~~~~~~~~~~~~~~~\:________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[1_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[2_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[3_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[4_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[5_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[6_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[7_______^ 1stHalf-WARP :________]______________ | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[ 8_______^:~~~~~~~~~~~~~~~~:________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[ 9_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[ A_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[ B_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[ C_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[ D_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_| ……………………………………………|t[ E_______^ :________] | |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn…………………………………………….|t[ F_______^___________WARP__:________]_____________ | | ________________ °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°° | / \ CC-2.0

|| ~masked

|| | / \ 1.hW ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^| | / \ 2.hW |^|^|^|^|^|^|^|^|^|^|^|^|^ |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^ |_______________/ \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I| |~~~~~~~~~~~~~~/ SM:0.warpScheduler /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I | \ | // | \ RR-mode // | \ GREEDY-mode // | \________________// | \______________/SM:0__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:1__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:2__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:3__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:4__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:5__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:6__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:7__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:8__________________________________________________________________________________ | | |t[ F_______^___________WARP__:________]_______ | ..|SM:9__________________________________________________________________________________ | ..|SM:A |t[ F_______^___________WARP__:________]_______ | ..|SM:B |t[ F_______^___________WARP__:________]_______ | ..|SM:C |t[ F_______^___________WARP__:________]_______ | ..|SM:D |t[ F_______^___________WARP__:________]_______ | |_______________________________________________________________________________________ */


底线?

任何低延迟动机devise都必须对“I / O液压系统”进行逆向工程(因为0 1-XFERs本质上是不可压缩的),并且由此产生的等待时间决定了GPGPU解决scheme的性能包络,因为它是计算密集型的:在哪里处理成本是多余的一个可怜的延迟XFERs …)或不 (在哪里(可能会有人惊讶)CPU-s在GPU端到端处理速度比GPU结构快[引用] )。