Parallel memory access on modern processors

Tags:

I have a recent 12 core Intel CPU (Haswell architecture) and it has 4 memory channels. How many DRAM memory accesses can the machine do in parallel?

For example, if I have a program that uses 12 threads that sit in a tight loop that reads a single byte from random memory addresses over a range too large to fit in cache. I expect all 12 threads will spend almost all of their time waiting for the memory fetch.

Do the threads have to take it in turns to use the DRAM bus?

NOTE: Assume I'm using a 1-GB VM page size so there are no TLB cache misses.

289

asked Jul 21 '17 11:07

Andrew Bainbridge

1 Answers

The Intel datasheets almost answer this question.

My first clue was a question on the Intel forums: https://communities.intel.com/thread/110798

Jaehyuk.Lee, 01-Feb-2017 09:27 asked almost the same question as me:

The Second question is about simultaneous requests on IMC and its support on brand-new CPU models such as skylake and kaby-lake http://www.intel.com/Assets/PDF/datasheet/323341.pdf Following the above link, "The memory controller can operate on up to 32 simultaneous requests(reads and writes)" I would like to know how many simultaneous requests are supported in skylake and kabylake CPUs. I've already checked the 6th and 7th generation of the Intel CPU datasheet, but I cannot find any information.

The link is dead. But his "32" figure sounds plausible.

An Intel staff member responded, quoting from 6th Generation Intel® Processor Families for S-Platforms, Vol 1:

The memory controller has an advanced command scheduler where all pending requests are examined simultaneously to determine the most efficient request to be issued next. The most efficient request is picked from all pending requests and issued to system memory Just-in-Time to make optimal use of Command Overlapping. Thus, instead of having all memory access requests go individually through an arbitration mechanism forcing requests to be executed one at a time, they can be started without interfering with the current request allowing for concurrent issuing of requests. This allows for optimized bandwidth and reduced latency while maintaining appropriate command spacing to meet system memory protocol.

Annoyingly the datasheet for my Xeon E5-2670 v3 does not contain an equivalent section.

Another part of the answer is that the E5-2670 has 4 DDR channels. Memory is interleaved on a 256 byte granularity to optimize bandwidth. In other words, if you read a 1024 byte block from address 0, the first 256 bytes are fetched from DIMM 0. Bytes 256 to 511 are from DIMM 1 etc.

Putting the two together, I suspect that the memory controller can do 4 reads in parallel and is smart enough that if 4 or more threads are waiting on reads that map into 4 different DIMMs, it will do them in parallel. And it has enough hardware to keep up to about 32 reads/writes in its scheduling table.

I can think of another possible way to achieve parallelism. Each DDR channel has its own data and address buses. When the memory controller requests a read, it uses the address lines + some control lines to request the read and then waits for the response. For a random read, typically there are two waits - the RAS to CAS delay and CAS latency - about 15 cycles each. Instead of leaving the address lines idle, you could imagine the memory controller starting another read from a different DIMM (*) during these wait periods. I have no idea if this is done.

* Actually, according to this Anandtech article there is more parallelness in the DRAM hardware than just having multiple DIMMs per channel. Each DIMM might have multiple ranks and each rank has many banks. I think you could switch to any other rank and bank within a DIMM to perform another access in parallel.

EDIT

I measured that my machine can do at least 6 random accesses in parallel, despite only having 4 memory channels. So a single memory channel can perform 2 or more random accesses in parallel, perhaps using the scheme I described in the paragraph above.

To get this information, I used tinymembench to measure the latency of DRAM access on my machine. The result was 60 ns. I then wrote a small C program to perform 32-bit reads from a 1 GB table of random numbers and use the result to increment a checksum. Pseudo code:

uint32_t checksum = 0;
for (int i = 0; i < 256 * 1024 * 1024; i++) {
    unsigned offset = rand32() & (TABLE_SIZE - 1);
    checksum += table_of_random_numbers[offset];
}

Each iteration of the loop took 10 ns on average. This is because the out-of-order and speculative execution features in my CPU were able to parallelize this loop 6 times. ie 10 ns = 60 ns / 6.

If instead I replaced the code with:

unsigned offset = rand32() & (TABLE_SIZE - 1);
for (int i = 0; i < 256 * 1024 * 1024; i++) {
    offset = table_of_random_numbers[offset];
    offset &= (TABLE_SIZE - 1);
}

Then each iteration takes 60 ns, because the loop cannot be parallized. It cannot be parallized because the address of each access depends on the result of the previous read.

I also checked the assembly the compiler generated to make sure it hadn't done the parallizing.

EDIT 2

I decided to test what happens when I run multiple tests in parallel, each as a separate process. I used the program snippet above that includes the checksum (ie the one that appears to see a latency of 10 ns per access). By running 6 instances in parallel, I get an average apparent latency of 13.9 ns, meaning that about 26 accesses must be occurring in parallel. (60 ns / 13.9 ns) * 6 = 25.9.

6 instances was optimal. Any more caused the overall throughput to drop.

EDIT 3 - Responding the Peter Cordes RNG question

I tried two different random number generators.

uint32_t g_seed = 12345;
uint32_t fastrand() {
    g_seed = 214013 * g_seed + 2531011;
    return g_seed;
}

and

// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
typedef struct { uint64_t state;  uint64_t inc; } pcg32_random_t;

uint32_t pcg32_random_r(pcg32_random_t* rng)
{
    uint64_t oldstate = rng->state;
    // Advance internal state
    rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1);
    // Calculate output function (XSH RR), uses old state for max ILP
    uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
    uint32_t rot = oldstate >> 59u;
    return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}

They both performed about the same. I can't remember the exact numbers. The peak single threaded performance I saw was with the simpler RNG, and it gave me an amortised latency of 8.5 ns, implying 7 reads in parallel. The assembly for the timed loop was:

// Pseudo random number is in edx
// table is in rdi
// loop counter is in rdx
// checksum is in rax
.L8:
        imull   $214013, %edx, %edx
        addl    $2531011, %edx
        movl    %edx, %esi
        movl    %edx, g_seed(%rip)
        andl    $1073741823, %esi
        movzbl  (%rdi,%rsi), %esi
        addq    %rsi, %rax
        subq    $1, %rcx
        jne     .L8
        ret

I don't understand "g_seed(%rip)". Is that a memory access? Why would the compiler do that?

EDIT 4 - Removed global variable from random number generator

I removed the global variable from the random number generator as suggested by Peter. The generated code was indeed cleaner. I also switched to Intel syntax for the disassembly (thanks for the tip).

// Pseudo random number is in edx
// table is in rdi
// loop counter is in rdx
// checksum is in rax
.L8:
        imul    edx, edx, 214013
        add     edx, 2531011
        mov     esi, edx
        and     esi, 1073741823
        movzx   esi, BYTE PTR [rdi+rsi]
        add     rax, rsi
        sub     rcx, 1
        jne     .L8
        ret

The performance was unchanged though, both the single process and multi-process cases.

177

answered Sep 18 '22 08:09

Andrew Bainbridge

Related questions
                            
                                Constantly growing memory allocation while fetching images over HTTP in iOS
                            
                                What is The Memory Address of Character Table In DOS? [closed]
                            
                                Java byte array of 1 MB or more takes up twice the RAM
                            
                                Understanding CLR 2.0 Memory Model
                            
                                File size vs. in memory size in Java
                            
                                Memory consumption of NumPy function for standard deviation
                            
                                How does Python determine if two strings are identical
                            
                                What is saved in a context switch?
                            
                                What is exactly happening when I spawn a new thread from .NET?
                            
                                Unix: sharing already-mapped memory between processes
                            
                                Scikit and Pandas: Fitting Large Data
                            
                                What is the meaning of "From Space" and "To Space" shown in jmap?
                            
                                What would cause a java process to greatly exceed the Xmx or Xss limit?
                            
                                Optimize Node.js memory consumption
                            
                                Profiling memory usage on App Engine
                            
                                Reduce memory footprint of data.table with highly repeated key
                            
                                Tomcat memory consumption is more than heap + permgen space
                            
                                Why isn't RAM as fast as registers/cache memory? [closed]
                            
                                How do I free a pointer returned from a function?
                            
                                Xcode - My app crash and the error is "Invalid pointer dequeued from free list *** set a breakpoint in malloc_error_break to debug"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel memory access on modern processors

Tags:

memory

x86

parallel-processing

multicore

bus

Andrew Bainbridge

People also ask

1 Answers

Andrew Bainbridge

Recent Activity

Donate For Us