Why discrepancy in theoretical vs STREAM-measured Nehalem memory bandwidths?

idata · ‎11-04-2009

Hello everyone,

My apologies for the cross-post - I think this is the relevant forum, I had accidentally asked a similar question in the "open port IT" forums..

I have two questions about theoretical vs actual memory bandwidth performance of Nehalem processors. I would very much appreciate the guidance of someone who knows the architecture.

1) I am wondering why the theoretical memory interface bandwidth on Nehalem processors is not close to what the STREAM benchmark gives (which is designed to give indealized streaming data)?

With a memory interface that has 3 channels running DDR3-1333 at 1333 MHz, I understand that the theoretical memory bandwidth should be:

BW_theoretical = 1333 megatransfers per second (including DDR) * 3 channels * 64 bit wide bus / (8 bits/byte) = 32,000 MB/s ~ 32GB/s per socket

The above is also what I see online http://ark.intel.com/Product.aspx?id=37106 http://ark.intel.com/Product.aspx?id=37106, and even on some non-Intel sites so it sounds right.

On the other hand, it is commonly reported that Nehalem-based Xeon 5500 series processors (X5550, E5520, etc) get about 15-17GB/s per socket, or up to about 37GB/s in a dual-socket configuration (e.g. http://www.advancedclustering.com/company-blog/stream-benchmarking.html http://www.advancedclustering.com/company-blog/stream-benchmarking.html).

Why the difference? What determines the memory bandwidth when you have an idealized situation where each CPU has the data in its local memory, and it is streaming a continuous data stream from or to memory.

2) Does the memory bandwidth depend on the CPU clock frequency (e.g. 2.66 vs 2.93 vs 3.33 GHz), and if so, why when the memory bus is 1333MHz fixed always? (it appears that it does, but I wasn't able to find a nice test that shows the dependence clearly). And, is there a simple way one can calculate mem bw vs. cpu speed (like I did above without CPU accounted), as a result?

Thank you for your help!

Milos Popovic

idata · ‎11-06-2009

1) I think 32GB/s is a theoretical maximium, I think in reality there is other communications on the bus that eat into the amount of data you see in reality. If you think about running STREAM on windows there is a ton of other stuff going on at the sametime that interfering with the system delivering the theroy results.

2) The low end chips have 800Mhz ram bus, mid's have 1066 and high end 1333, so I assume you see that in the benchmarks?

idata · ‎11-06-2009

Ytterbium wrote:

1) I think 32GB/s is a theoretical maximium, I think in reality there is other communications on the bus that eat into the amount of data you see in reality. If you think about running STREAM on windows there is a ton of other stuff going on at the sametime that interfering with the system delivering the theroy results.

Indeed, 32GB/s is what I compute for the theoretical throughput of the memory interface. However, when you have a high-performance computing application that is doing data streaming from memory, there should be very little else happening in the OS -- certainly not enough to cause a 50% drop (32GB/s to the measured 17-18GB/s or so).

2) The low end chips have 800Mhz ram bus, mid's have 1066 and high end 1333, so I assume you see that in the benchmarks?

Yes, the benchmarks show a small drop with lower bus frequency, but even 1333MHz doesn't get 32GB/s. See here:

http://www.advancedclustering.com/company-blog/stream-benchmarking.html http://www.advancedclustering.com/company-blog/stream-benchmarking.html

I was told by a few people that it can have to do with a number of issues inside the CPU itself, i.e. that the CPU can't generate requests fast enough, or that the memory can't keep the memory bus full even when it is streaming due to various latencies. We concluded that the second is probably not true, and that various DIMM banks are synchronized when streaming to hide the latencies, so the bus is always full. Some other possibilities are that reading and writing at the same time were an issue, but tests of only reading and only writing also give similar throughput to the other benchmarks -- hence don't explain the problem. Finally, dual-socket benchmarks give 37GB/s (vs theoretical of 2x32 = 64GB/s) at 1333MHz, and this part could be because of cache coherence -- one CPU having to look in the cache of the other before it goes to memory. Nehalem seems not to have a probe filter ("snoop filter") that can hide this latency, like AMD Istanbul does. Which may explain the slowdown. However, even a single-socket Intel CPU gets 18GB/s-ish.. so it's still not explained.

I was hoping that someone who understands Nehalem and/or AMD architecture could explain exactly what's going on -- and whether it's the CPU/memory controller, or the DRAM modules, that are limiting the bandwidth, and why.

Milos

idata · ‎11-07-2009

I think you'd be best to jump on a blog post from someone at Intel and see if you can get an answer.