How to maximize DDR3 memory data transfer rate?

Tags:

I am trying to measure DDR3 memory data transfer rate through a test. According to the CPU spec. maximum theoritical bandwidth is 51.2 GB/s. This should be the combined bandwidth of four channels, meaning 12.8 GB/channel. However, this is a theoretical limit and I am curious of how to further increase the practical limit in this post. In the below described test scenario I achieve a ~14 GB/s data transfer rate which I believe may be a close approximation when killing most of the throuhgput boost of the CPU L1, L2, and L3 caches.

Update 20/3 2014: This assumption of killing the L1-L3 caches is wrong. The harware prefetching of the memory controller will analyze the data accesses pattern and since it sequential, it will have an easy task of prefetching data into the CPU caches.

Specific questions follow at the bottom but mainly I am interested in a) a verifications of the assumptions leading up to this result, and b) if there is a better way measuring memory bandwith in .NET.

I have constructed a test in C# on .NET as a starter. Although .NET is not ideal from a memory allocation perspective, I think it is doable for this test (please let me know if you disagree and why). The test is to allocate an int64 array and fill it with integers. This array should have data aligned in memory. Then I simply loop this array using as many threads as I have cores on the machine and read the int64 value from the array and set it to a local public field in the test class. Since the result field is public, I should avoid compiler optimising away stuff in the loop. Futhermore, and this may be a weak assumption, I think the result stays in the register and not written to memory until it is over written again. Between each read of an element in the array I use an variable Step offset of 10, 100, and 1000 in the array in order to not be able to fetch many references in the same cache block (64 byte).

Reading the Int64 from the array should mean a lookup read of 8 bytes and then the read of the actual value another 8 byte. Since data is fetched from memory in 64 byte cache line, each read in the array should correspond to a 64 byte read from RAM each time in the loop given that the read data is not located in any CPU caches.

Here is how I initiallize the data array:

_longArray = new long[Config.NbrOfCores][];
for (int threadId = 0; threadId < Config.NbrOfCores; threadId++)
{
    _longArray[threadId] = new long[Config.NmbrOfRequests];
    for (int i = 0; i < Config.NmbrOfRequests; i++)
        _longArray[threadId][i] = i;
}

And here is the actual test:

GC.Collect();
timer.Start();
Parallel.For(0, Config.NbrOfCores, threadId =>
{
    var intArrayPerThread = _longArray[threadId];
    for (int redo = 0; redo < Config.NbrOfRedos; redo++)
        for (long i = 0; i < Config.NmbrOfRequests; i += Config.Step) 
            _result = intArrayPerThread[i];                        
});
timer.Stop();

Since the data summary is quite important for the result I give this info too (can be skipped if you trust me...)

var timetakenInSec = timer.ElapsedMilliseconds / (double)1000;
long totalNbrOfRequest = Config.NmbrOfRequests / Config.Step * Config.NbrOfCores*Config.NbrOfRedos; 
var throughput_ReqPerSec = totalNbrOfRequest / timetakenInSec;
var throughput_BytesPerSec = throughput_ReqPerSec * byteSizePerRequest;
var timeTakenPerRequestInNanos = Math.Round(1e6 * timer.ElapsedMilliseconds / totalNbrOfRequest, 1);
var resultMReqPerSec = Math.Round(throughput_ReqPerSec/1e6, 1);
var resultGBPerSec = Math.Round(throughput_BytesPerSec/1073741824, 1);
var resultTimeTakenInSec = Math.Round(timetakenInSec, 1);

Neglecting to give you the actual output rendering code I get the following result:

Step   10: Throughput:   570,3 MReq/s and         34 GB/s (64B),   Timetaken/request:      1,8 ns/req, Total TimeTaken: 12624 msec, Total Requests:   7 200 000 000
Step  100: Throughput:   462,0 MReq/s and       27,5 GB/s (64B),   Timetaken/request:      2,2 ns/req, Total TimeTaken: 15586 msec, Total Requests:   7 200 000 000
Step 1000: Throughput:   236,6 MReq/s and       14,1 GB/s (64B),   Timetaken/request:      4,2 ns/req, Total TimeTaken: 30430 msec, Total Requests:   7 200 000 000

Using 12 threads instead of 6 (since the CPU is hyper threaded) I get pretty much the same throughput (as expected I think): 32.9 / 30.2 / 15.5 GB/s .

As can be seen, throughput drops as the step increases which I think is normal. Partly I think it is due to that the 12 MB L3 cache forces mores cache misses and partly it may be the Memory Controllers prefetch mechanism that is not working as well when the reads are so far apart. I further believe that the step 1000 result is the closest one to the actual practical memory speed since it should kill most of the CPU caches and "hopefully" kill the prefetch mechanism. Futher more I am assuming that most of the overhead in this loop is the memory fetch operation and not something else.

hardware for this test is: Intel Core I7-3930 (specs: CPU breif, more detailed, and really detailed spec ) using 32 GB total of DDR3-1600 memories.

Open questions

Am I correct in the assumptions made above?
Is there a way to increase the use of the memory bandwidth? For instance by doing it in C/C++ instead and spread out memory allocation more on heap enabling all four memory channels to be used.
Is there a better way to measure the memory data transfer?

Much obliged for input on this. I know it is a complex area under the hood...

All code here is available for download at https://github.com/Toby999/ThroughputTest. Feel free to contact me at an forwarding email tobytemporary[at]gmail.com.

611

asked Dec 12 '13 20:12

Toby999

2 Answers

The decrease in throughput as you increase step is likely caused by the memory prefetching not working well anymore if you don't stride linearly through memory.

Things you can do to improve the speed:

The test speed will be artificially bound by the loop itself taking up CPU cycles. As Roy shows, more speed can be achieved by unfolding the loop.
You should get rid of boundary checking (with "unchecked")
Instead of using Parallel.For, use Thread.Start and pin each thread you start on a separate core (using the code from here: Set thread processor affinity in Microsoft .Net)
Make sure all threads start at the same time, so you don't measure any stragglers (you can do this by spinning on a memory address that you Interlock.Exchange to a new value when all threads are running and spinning)
On a NUMA machine (for example a 2 Socket Modern Xeon), you may have to take extra steps to allocate memory on the NUMA node that a thread will live on. To do this, you need to PInvoke VirtualAllocExNuma
Speaking of memory allocations, using Large Pages should provide yet another boost

While .NET isn't the easiest framework to use for this type of testing, it IS possible to coax it into doing what you want.

175

answered Oct 02 '22 22:10

Thomas Kejser

Reported RAM results (128 MB) for my bus8thread64.exe benchmark on an i7 3820 with max memory bandwidth of 51.2 GB/s, vary from 15.6 with 1 thread, 28.1 with 2 threads to 38.7 at 8 threads. Code is:

   void inc1word(IDEF data1[], IDEF ands[], int n)
    {
       int i, j;

       for(j=0; j<passes1; j++)
       {
           for (i=0; i<wordsToTest; i=i+64)
           {
               ands[n] = ands[n] & data1[i   ] & data1[i+1 ] & data1[i+2 ] & data1[i+3 ]
                                 & data1[i+4 ] & data1[i+5 ] & data1[i+6 ] & data1[i+7 ]
                                 & data1[i+8 ] & data1[i+9 ] & data1[i+10] & data1[i+11]
                                 & data1[i+12] & data1[i+13] & data1[i+14] & data1[i+15]
                                 & data1[i+16] & data1[i+17] & data1[i+18] & data1[i+19]
                                 & data1[i+20] & data1[i+21] & data1[i+22] & data1[i+23]
                                 & data1[i+24] & data1[i+25] & data1[i+26] & data1[i+27]
                                 & data1[i+28] & data1[i+29] & data1[i+30] & data1[i+31]
                                 & data1[i+32] & data1[i+33] & data1[i+34] & data1[i+35]
                                 & data1[i+36] & data1[i+37] & data1[i+38] & data1[i+39]
                                 & data1[i+40] & data1[i+41] & data1[i+42] & data1[i+43]
                                 & data1[i+44] & data1[i+45] & data1[i+46] & data1[i+47]
                                 & data1[i+48] & data1[i+49] & data1[i+50] & data1[i+51]
                                 & data1[i+52] & data1[i+53] & data1[i+54] & data1[i+55]
                                 & data1[i+56] & data1[i+57] & data1[i+58] & data1[i+59]
                                 & data1[i+60] & data1[i+61] & data1[i+62] & data1[i+63];
           }
        }
    }

This also measures burst reading speeds, where max DTR, based on this, is 46.9 GB/s. Benchmark and source code are in:

http://www.roylongbottom.org.uk/quadcore.zip

For results with interesting speeds using L3 caches are in:

http://www.roylongbottom.org.uk/busspd2k%20results.htm#anchor8Thread

answered Oct 02 '22 21:10

Roy Longbottom

Related questions
                            
                                In C#, is it possible to implement an interface member using a member with a different name, like you can do in VB.NET?
                            
                                how does a c# profiler work?
                            
                                Adding styles and scripts to ASP.NET web controls (ascx) without repeating inclusion directives
                            
                                .NET: ThreadStatic vs lock { }. Why ThreadStaticAttribute degrades performance?
                            
                                serial communication, read the 9th bit
                            
                                Request.QueryString[] vs. Request.Query.Get() vs. HttpUtility.ParseQueryString()
                            
                                Is it OK to pass a stream around to multiple methods? [closed]
                            
                                Converting an extension method group to a delegate with a generic type
                            
                                Why would I need to check for greater than Int32.MaxValue?
                            
                                How do I path relative CSS paths correctly when using Visual Studio 2012 Bundling? [duplicate]
                            
                                Assign a method with default values to Func<> without those parameters?
                            
                                Inconsistent multiplication performance with floats
                            
                                Using ILMerge with log4net is causing "inaccessible due to protection level" error
                            
                                The predefined type 'System.Threading.Tasks.Task' is defined in multiple assemblies in the global alias
                            
                                "nested if" versus "if and" performance using F#
                            
                                Overloading base method in derived class
                            
                                Create an avatar upload form for users
                            
                                How does preCondition="managedHandler" work for modules?
                            
                                C# huge performance drop assigning float value
                            
                                How to resolve Web API Message Handler / DelegatingHandler from IoC container on each request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to maximize DDR3 memory data transfer rate?

Tags:

memory-management

c#

.net

memory

ram

Toby999

People also ask

2 Answers

Thomas Kejser

Roy Longbottom

Recent Activity

Donate For Us