I'm studying CUDA while following this guide.
I haven't finished it yet, but I decided to play a bit with what I've seen so far.
I tried to rewrite the first example where 256 threads are used. I wanted to make it so each thread operates on a contiguous slice of the array.
The goal is to sum 2 arrays with 1,048,576 items.
For comparison, this is the original code, where each array item is accessed according to a stride:
__global__ void add(int n, float *x, float *y) { int index = threadIdx.x; int stride = blockDim.x; for (int i = index; i < n; i += stride) y[i] = x[i] + y[i]; }
This is my function:
__global__
void add2(int n, float* x, float* y) {
int sliceSize = n / blockDim.x;
int lower = threadIdx.x * sliceSize;
int upper = lower + sliceSize;
for (int i = lower; i < upper; i++) {
y[i] = x[i] + y[i];
}
}
It turns out the last snippet performs almost 7x slower than the previous one (22ms versus 3ms). I thought that by accessing them on a contiguous slice, it would perform the same OR faster.
I'm invoking the function with add<<<1, threads>>>(n, x, y)
and add<<<1, threads>>>(n, x, y)
(256 threads).
The value of sliceSize
is always 4096
. In this case, what should happen is:
threadIdx.x = 0
goes from 0 to 4095threadIdx.x = 1
goes from 4096 to 8191threadIdx.x = 255
goes from 1044480 to 1048576I turned on NVidia Visual Profiler, and what I understood is that my memory access pattern is not efficient (Low Global Memory Load/Store Efficiency). This warning is not present with the first snippet. Why is this the case?
I thought the first snipped would jump all around the array, thus creating a bad access pattern. In reality, it seems to be fine.
I've read some documentation about memory optimization that comes with the visual profiler, but I don't quite understand why this is so slow.
You're exploring the difference between coalesced and uncoalesced memory access. Or we could simply say "most efficient" and "less efficient" memory access.
On a GPU, all instructions are executed warp-wide. So when one thread in a warp is reading a location in memory, all threads in the warp are reading from memory. The optimal pattern, roughly speaking, is when all threads in the warp are reading from adjacent locations. This results in a situation where the GPU memory controller, after inspecting the memory addresses requested by each thread in the warp for a particular read cycle, can coalesce addresses together to result in the need for the minimum number of lines to request from the cache (or the minimum number of segments to request from DRAM).
Such a situation is depicted pictorially on slide 36 (or 37) here.
The 100% coalesced case is represented in your first code snippet. An example of a read from global memory is here:
y[i] = x[i] + y[i];
^
reading from the vector x in global memory
let's consider the first pass of the loop, and let's consider the case of the first warp (ie. the first 32 threads in the threadblock). In this case, i
is given by threadIdx.x
. Therefore thread 0 has an index of 0, 1 has an index of 1, and so on. Therefore each thread is reading an adjacent location in global memory. Assuming we miss all the caches, this will translate into a DRAM read request, and the memory controller can generate the minimum number of requests (more precisely: transactions) for segments from DRAM (or equivalently for lines in the cache). It is optimal in the sense that the "bus bandwidth utilization" is 100%. Every byte requested was actually used by a thread in the warp, on that read cycle.
"Uncoalesced" access may generically refer to any case which does not fit the above description. Translating into the finer-grained "bus bandwidth utilization" number described above, uncoalesced access may have varying degrees, from a best case of just below 100%, to a worst case of 12.5%, or 3.125%, depending on the exact case and GPU.
A worst-case uncoalesced access pattern example according to this description is given in slide 44 (or 45) here. This doesn't exactly describe your worst case code snippet, but for a large enough sliceSize
it is equivalent. The line of code is the same. Considering the same read request (for x
, by warp 0, on the first iteration of the loop), the only difference is in the values that i
takes on across the warp:
int sliceSize = n / blockDim.x;
int lower = threadIdx.x * sliceSize;
...
for (int i = lower; i < upper; i++) {
y[i] = x[i] + y[i];
So i
starts out at lower
, which is just threadIdx.x * sliceSize
. Let's suppose that sliceSize
is greater than 1. Then the first thread will read location 0. The second thread will read location sliceSize
. The third thread will read location 2*sliceSize
, etc. These locations are separated by sliceSize
distance. Even if sliceSize
is only 2, the pattern is still less efficient, as the memory controller must now request twice as many lines or segments, to satisfy this particular read cycle across warp 0. If sliceSize
is large enough, the memory controller must request a unique line or segment for each thread, which is the worst-case pattern.
As a final note/takeaway, a useful observation for "quick analysis" can be made:
threadIdx.x
by any quantity other than 1.threadIdx.x
in any given indexing calculation is multiplied by some number not equal to 1, then regardless of other considerations, this is a nearly universal indication that the access pattern generated will be non-optimal.To repeat this for clarity:
index = any_constant_across_the_warp + threadIdx.x;
will generally be an optimal access pattern.
index = any_constant_across_the_warp + C*threadIdx.x;
will generally not be an optimal access pattern. Note that any_constant_across_the_warp
can be composed of arbitrary arithmetic on quantities like: the loop index, blockIdx.?
, blockDim.?
, gridDim.?
and any other constants. Some thought must be given to 2D or 3D threadblock patterns, where threadIdx.y
will be taken into account, but it's generally not hard to extend this understanding to the 2D case. For typical threadblock shapes, for quick analysis, you generally would not want constant multipliers on either threadIdx.x
or threadIdx.y
.
This entire discussion applies to global memory reads/writes. Shared memory also has rules for optimal access, which are in some ways similar to the above description and in some ways quite different. However, it is generally true that a fully optimal 100% coalesced pattern for global memory will also be an optimal pattern for shared memory reads/writes. Another way to say that is that adjacent access in a warp is generally optimal for shared memory as well (but it is not the only possible optimal pattern for shared memory).
The presentation already linked here will give a fuller treatment of this topic, as will many other presentations and treatments on the web.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With