I have a recent 12 core Intel CPU (Haswell architecture) and it has 4 memory channels. How many DRAM memory accesses can the machine do in parallel?
For example, if I have a program that uses 12 threads that sit in a tight loop that reads a single byte from random memory addresses over a range too large to fit in cache. I expect all 12 threads will spend almost all of their time waiting for the memory fetch.
Do the threads have to take it in turns to use the DRAM bus?
NOTE: Assume I'm using a 1-GB VM page size so there are no TLB cache misses.
There are three basic parallel processing hardware architectures in the server market such as symmetric multiprocessing (SMP), massively parallel processing (MPP), and non-uniform memory architecture (NUMA).
Parallel processing is a method in computing of running two or more processors (CPUs) to handle separate parts of an overall task. Breaking up different parts of a task among multiple processors will help reduce the amount of time to run a program.
From a hardware perspective, a shared memory parallel architecture is a computer that has a common physical memory accessible to a number of physical processors. The two basic types of shared memory architectures are Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA), as shown in Fig. 4.
Amdahl's law states that the maximum speedup possible in parallelizing an algorithm is limited by the sequential portion of the code. Given an algorithm which is P% parallel, Amdahl's law states that: MaximumSpeedup=1/(1- (P/100)). For example if 80% of a program is parallel, then the maximum speedup is 1/(1-0.8)=1/.
The Intel datasheets almost answer this question.
My first clue was a question on the Intel forums: https://communities.intel.com/thread/110798
Jaehyuk.Lee, 01-Feb-2017 09:27 asked almost the same question as me:
The Second question is about simultaneous requests on IMC and its support on brand-new CPU models such as skylake and kaby-lake http://www.intel.com/Assets/PDF/datasheet/323341.pdf Following the above link, "The memory controller can operate on up to 32 simultaneous requests(reads and writes)" I would like to know how many simultaneous requests are supported in skylake and kabylake CPUs. I've already checked the 6th and 7th generation of the Intel CPU datasheet, but I cannot find any information.
The link is dead. But his "32" figure sounds plausible.
An Intel staff member responded, quoting from 6th Generation Intel® Processor Families for S-Platforms, Vol 1:
The memory controller has an advanced command scheduler where all pending requests are examined simultaneously to determine the most efficient request to be issued next. The most efficient request is picked from all pending requests and issued to system memory Just-in-Time to make optimal use of Command Overlapping. Thus, instead of having all memory access requests go individually through an arbitration mechanism forcing requests to be executed one at a time, they can be started without interfering with the current request allowing for concurrent issuing of requests. This allows for optimized bandwidth and reduced latency while maintaining appropriate command spacing to meet system memory protocol.
Annoyingly the datasheet for my Xeon E5-2670 v3 does not contain an equivalent section.
Another part of the answer is that the E5-2670 has 4 DDR channels. Memory is interleaved on a 256 byte granularity to optimize bandwidth. In other words, if you read a 1024 byte block from address 0, the first 256 bytes are fetched from DIMM 0. Bytes 256 to 511 are from DIMM 1 etc.
Putting the two together, I suspect that the memory controller can do 4 reads in parallel and is smart enough that if 4 or more threads are waiting on reads that map into 4 different DIMMs, it will do them in parallel. And it has enough hardware to keep up to about 32 reads/writes in its scheduling table.
I can think of another possible way to achieve parallelism. Each DDR channel has its own data and address buses. When the memory controller requests a read, it uses the address lines + some control lines to request the read and then waits for the response. For a random read, typically there are two waits - the RAS to CAS delay and CAS latency - about 15 cycles each. Instead of leaving the address lines idle, you could imagine the memory controller starting another read from a different DIMM (*) during these wait periods. I have no idea if this is done.
* Actually, according to this Anandtech article there is more parallelness in the DRAM hardware than just having multiple DIMMs per channel. Each DIMM might have multiple ranks and each rank has many banks. I think you could switch to any other rank and bank within a DIMM to perform another access in parallel.
EDIT
I measured that my machine can do at least 6 random accesses in parallel, despite only having 4 memory channels. So a single memory channel can perform 2 or more random accesses in parallel, perhaps using the scheme I described in the paragraph above.
To get this information, I used tinymembench to measure the latency of DRAM access on my machine. The result was 60 ns. I then wrote a small C program to perform 32-bit reads from a 1 GB table of random numbers and use the result to increment a checksum. Pseudo code:
uint32_t checksum = 0;
for (int i = 0; i < 256 * 1024 * 1024; i++) {
unsigned offset = rand32() & (TABLE_SIZE - 1);
checksum += table_of_random_numbers[offset];
}
Each iteration of the loop took 10 ns on average. This is because the out-of-order and speculative execution features in my CPU were able to parallelize this loop 6 times. ie 10 ns = 60 ns / 6.
If instead I replaced the code with:
unsigned offset = rand32() & (TABLE_SIZE - 1);
for (int i = 0; i < 256 * 1024 * 1024; i++) {
offset = table_of_random_numbers[offset];
offset &= (TABLE_SIZE - 1);
}
Then each iteration takes 60 ns, because the loop cannot be parallized. It cannot be parallized because the address of each access depends on the result of the previous read.
I also checked the assembly the compiler generated to make sure it hadn't done the parallizing.
EDIT 2
I decided to test what happens when I run multiple tests in parallel, each as a separate process. I used the program snippet above that includes the checksum (ie the one that appears to see a latency of 10 ns per access). By running 6 instances in parallel, I get an average apparent latency of 13.9 ns, meaning that about 26 accesses must be occurring in parallel. (60 ns / 13.9 ns) * 6 = 25.9.
6 instances was optimal. Any more caused the overall throughput to drop.
EDIT 3 - Responding the Peter Cordes RNG question
I tried two different random number generators.
uint32_t g_seed = 12345;
uint32_t fastrand() {
g_seed = 214013 * g_seed + 2531011;
return g_seed;
}
and
// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;
uint32_t pcg32_random_r(pcg32_random_t* rng)
{
uint64_t oldstate = rng->state;
// Advance internal state
rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1);
// Calculate output function (XSH RR), uses old state for max ILP
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
uint32_t rot = oldstate >> 59u;
return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}
They both performed about the same. I can't remember the exact numbers. The peak single threaded performance I saw was with the simpler RNG, and it gave me an amortised latency of 8.5 ns, implying 7 reads in parallel. The assembly for the timed loop was:
// Pseudo random number is in edx
// table is in rdi
// loop counter is in rdx
// checksum is in rax
.L8:
imull $214013, %edx, %edx
addl $2531011, %edx
movl %edx, %esi
movl %edx, g_seed(%rip)
andl $1073741823, %esi
movzbl (%rdi,%rsi), %esi
addq %rsi, %rax
subq $1, %rcx
jne .L8
ret
I don't understand "g_seed(%rip)". Is that a memory access? Why would the compiler do that?
EDIT 4 - Removed global variable from random number generator
I removed the global variable from the random number generator as suggested by Peter. The generated code was indeed cleaner. I also switched to Intel syntax for the disassembly (thanks for the tip).
// Pseudo random number is in edx
// table is in rdi
// loop counter is in rdx
// checksum is in rax
.L8:
imul edx, edx, 214013
add edx, 2531011
mov esi, edx
and esi, 1073741823
movzx esi, BYTE PTR [rdi+rsi]
add rax, rsi
sub rcx, 1
jne .L8
ret
The performance was unchanged though, both the single process and multi-process cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With