I am trying to measure DDR3 memory data transfer rate through a test. According to the CPU spec. maximum theoritical bandwidth is 51.2 GB/s. This should be the combined bandwidth of four channels, meaning 12.8 GB/channel. However, this is a theoretical limit and I am curious of how to further increase the practical limit in this post. In the below described test scenario I achieve a ~14 GB/s data transfer rate which I believe may be a close approximation when killing most of the throuhgput boost of the CPU L1, L2, and L3 caches.
Update 20/3 2014: This assumption of killing the L1-L3 caches is wrong. The harware prefetching of the memory controller will analyze the data accesses pattern and since it sequential, it will have an easy task of prefetching data into the CPU caches.
Specific questions follow at the bottom but mainly I am interested in a) a verifications of the assumptions leading up to this result, and b) if there is a better way measuring memory bandwith in .NET.
I have constructed a test in C# on .NET as a starter. Although .NET is not ideal from a memory allocation perspective, I think it is doable for this test (please let me know if you disagree and why). The test is to allocate an int64 array and fill it with integers. This array should have data aligned in memory. Then I simply loop this array using as many threads as I have cores on the machine and read the int64 value from the array and set it to a local public field in the test class. Since the result field is public, I should avoid compiler optimising away stuff in the loop. Futhermore, and this may be a weak assumption, I think the result stays in the register and not written to memory until it is over written again. Between each read of an element in the array I use an variable Step offset of 10, 100, and 1000 in the array in order to not be able to fetch many references in the same cache block (64 byte).
Reading the Int64 from the array should mean a lookup read of 8 bytes and then the read of the actual value another 8 byte. Since data is fetched from memory in 64 byte cache line, each read in the array should correspond to a 64 byte read from RAM each time in the loop given that the read data is not located in any CPU caches.
Here is how I initiallize the data array:
_longArray = new long[Config.NbrOfCores][];
for (int threadId = 0; threadId < Config.NbrOfCores; threadId++)
{
_longArray[threadId] = new long[Config.NmbrOfRequests];
for (int i = 0; i < Config.NmbrOfRequests; i++)
_longArray[threadId][i] = i;
}
And here is the actual test:
GC.Collect();
timer.Start();
Parallel.For(0, Config.NbrOfCores, threadId =>
{
var intArrayPerThread = _longArray[threadId];
for (int redo = 0; redo < Config.NbrOfRedos; redo++)
for (long i = 0; i < Config.NmbrOfRequests; i += Config.Step)
_result = intArrayPerThread[i];
});
timer.Stop();
Since the data summary is quite important for the result I give this info too (can be skipped if you trust me...)
var timetakenInSec = timer.ElapsedMilliseconds / (double)1000;
long totalNbrOfRequest = Config.NmbrOfRequests / Config.Step * Config.NbrOfCores*Config.NbrOfRedos;
var throughput_ReqPerSec = totalNbrOfRequest / timetakenInSec;
var throughput_BytesPerSec = throughput_ReqPerSec * byteSizePerRequest;
var timeTakenPerRequestInNanos = Math.Round(1e6 * timer.ElapsedMilliseconds / totalNbrOfRequest, 1);
var resultMReqPerSec = Math.Round(throughput_ReqPerSec/1e6, 1);
var resultGBPerSec = Math.Round(throughput_BytesPerSec/1073741824, 1);
var resultTimeTakenInSec = Math.Round(timetakenInSec, 1);
Neglecting to give you the actual output rendering code I get the following result:
Step 10: Throughput: 570,3 MReq/s and 34 GB/s (64B), Timetaken/request: 1,8 ns/req, Total TimeTaken: 12624 msec, Total Requests: 7 200 000 000
Step 100: Throughput: 462,0 MReq/s and 27,5 GB/s (64B), Timetaken/request: 2,2 ns/req, Total TimeTaken: 15586 msec, Total Requests: 7 200 000 000
Step 1000: Throughput: 236,6 MReq/s and 14,1 GB/s (64B), Timetaken/request: 4,2 ns/req, Total TimeTaken: 30430 msec, Total Requests: 7 200 000 000
Using 12 threads instead of 6 (since the CPU is hyper threaded) I get pretty much the same throughput (as expected I think): 32.9 / 30.2 / 15.5 GB/s .
As can be seen, throughput drops as the step increases which I think is normal. Partly I think it is due to that the 12 MB L3 cache forces mores cache misses and partly it may be the Memory Controllers prefetch mechanism that is not working as well when the reads are so far apart. I further believe that the step 1000 result is the closest one to the actual practical memory speed since it should kill most of the CPU caches and "hopefully" kill the prefetch mechanism. Futher more I am assuming that most of the overhead in this loop is the memory fetch operation and not something else.
hardware for this test is: Intel Core I7-3930 (specs: CPU breif, more detailed, and really detailed spec ) using 32 GB total of DDR3-1600 memories.
Open questions
Am I correct in the assumptions made above?
Is there a way to increase the use of the memory bandwidth? For instance by doing it in C/C++ instead and spread out memory allocation more on heap enabling all four memory channels to be used.
Is there a better way to measure the memory data transfer?
Much obliged for input on this. I know it is a complex area under the hood...
All code here is available for download at https://github.com/Toby999/ThroughputTest. Feel free to contact me at an forwarding email tobytemporary[at]gmail.com.
DDR3 modules can transfer data at a rate of 800–2133 MT/s using both rising and falling edges of a 400–1066 MHz I/O clock. This is twice DDR2's data transfer rates (400–1066 MT/s using a 200–533 MHz I/O clock) and four times the rate of DDR (200–400 MT/s using a 100–200 MHz I/O clock).
DDR3-1600 memory has a module classification of PC3-12800, which effectively means the peak data rate of the module is 12.8GB/sec (see table). That's about a 17% improvement on memory bandwidth over DDR3-1333.
Memory is designed to be backward compatible within its generation, so generally speaking, you can safely add faster memory to a computer that was designed to run slower memory. However, your system will operate at the speed of the slowest memory module installed.
The decrease in throughput as you increase step is likely caused by the memory prefetching not working well anymore if you don't stride linearly through memory.
Things you can do to improve the speed:
Parallel.For
, use Thread.Start
and pin each thread you start on a separate core (using the code from here: Set thread processor affinity in Microsoft .Net)Interlock.Exchange
to a new value when all threads are running and spinning)VirtualAllocExNuma
While .NET isn't the easiest framework to use for this type of testing, it IS possible to coax it into doing what you want.
Reported RAM results (128 MB) for my bus8thread64.exe benchmark on an i7 3820 with max memory bandwidth of 51.2 GB/s, vary from 15.6 with 1 thread, 28.1 with 2 threads to 38.7 at 8 threads. Code is:
void inc1word(IDEF data1[], IDEF ands[], int n)
{
int i, j;
for(j=0; j<passes1; j++)
{
for (i=0; i<wordsToTest; i=i+64)
{
ands[n] = ands[n] & data1[i ] & data1[i+1 ] & data1[i+2 ] & data1[i+3 ]
& data1[i+4 ] & data1[i+5 ] & data1[i+6 ] & data1[i+7 ]
& data1[i+8 ] & data1[i+9 ] & data1[i+10] & data1[i+11]
& data1[i+12] & data1[i+13] & data1[i+14] & data1[i+15]
& data1[i+16] & data1[i+17] & data1[i+18] & data1[i+19]
& data1[i+20] & data1[i+21] & data1[i+22] & data1[i+23]
& data1[i+24] & data1[i+25] & data1[i+26] & data1[i+27]
& data1[i+28] & data1[i+29] & data1[i+30] & data1[i+31]
& data1[i+32] & data1[i+33] & data1[i+34] & data1[i+35]
& data1[i+36] & data1[i+37] & data1[i+38] & data1[i+39]
& data1[i+40] & data1[i+41] & data1[i+42] & data1[i+43]
& data1[i+44] & data1[i+45] & data1[i+46] & data1[i+47]
& data1[i+48] & data1[i+49] & data1[i+50] & data1[i+51]
& data1[i+52] & data1[i+53] & data1[i+54] & data1[i+55]
& data1[i+56] & data1[i+57] & data1[i+58] & data1[i+59]
& data1[i+60] & data1[i+61] & data1[i+62] & data1[i+63];
}
}
}
This also measures burst reading speeds, where max DTR, based on this, is 46.9 GB/s. Benchmark and source code are in:
http://www.roylongbottom.org.uk/quadcore.zip
For results with interesting speeds using L3 caches are in:
http://www.roylongbottom.org.uk/busspd2k%20results.htm#anchor8Thread
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With