Summing two arrays with CUDA

Question

I'm studying CUDA while following this guide.

I haven't finished it yet, but I decided to play a bit with what I've seen so far.

I tried to rewrite the first example where 256 threads are used. I wanted to make it so each thread operates on a contiguous slice of the array.

The goal is to sum 2 arrays with 1,048,576 items.

For comparison, this is the original code, where each array item is accessed according to a stride:

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

This is my function:

__global__
void add2(int n, float* x, float* y) {
    int sliceSize = n / blockDim.x;
    int lower = threadIdx.x * sliceSize;
    int upper = lower + sliceSize;
    for (int i = lower; i < upper; i++) {
        y[i] = x[i] + y[i];
    }
}

It turns out the last snippet performs almost 7x slower than the previous one (22ms versus 3ms). I thought that by accessing them on a contiguous slice, it would perform the same OR faster.

I'm invoking the function with add<<<1, threads>>>(n, x, y) and add<<<1, threads>>>(n, x, y) (256 threads).

The value of sliceSize is always 4096. In this case, what should happen is:

threadIdx.x = 0 goes from 0 to 4095
threadIdx.x = 1 goes from 4096 to 8191
...
threadIdx.x = 255 goes from 1044480 to 1048576

I turned on NVidia Visual Profiler, and what I understood is that my memory access pattern is not efficient (Low Global Memory Load/Store Efficiency). This warning is not present with the first snippet. Why is this the case?

I thought the first snipped would jump all around the array, thus creating a bad access pattern. In reality, it seems to be fine.

I've read some documentation about memory optimization that comes with the visual profiler, but I don't quite understand why this is so slow.

Robert Crovella · Accepted Answer

You're exploring the difference between coalesced and uncoalesced memory access. Or we could simply say "most efficient" and "less efficient" memory access.

On a GPU, all instructions are executed warp-wide. So when one thread in a warp is reading a location in memory, all threads in the warp are reading from memory. The optimal pattern, roughly speaking, is when all threads in the warp are reading from adjacent locations. This results in a situation where the GPU memory controller, after inspecting the memory addresses requested by each thread in the warp for a particular read cycle, can coalesce addresses together to result in the need for the minimum number of lines to request from the cache (or the minimum number of segments to request from DRAM).

Such a situation is depicted pictorially on slide 36 (or 37) here.

The 100% coalesced case is represented in your first code snippet. An example of a read from global memory is here:

  y[i] = x[i] + y[i];
         ^
         reading from the vector x in global memory

let's consider the first pass of the loop, and let's consider the case of the first warp (ie. the first 32 threads in the threadblock). In this case, i is given by threadIdx.x. Therefore thread 0 has an index of 0, 1 has an index of 1, and so on. Therefore each thread is reading an adjacent location in global memory. Assuming we miss all the caches, this will translate into a DRAM read request, and the memory controller can generate the minimum number of requests (more precisely: transactions) for segments from DRAM (or equivalently for lines in the cache). It is optimal in the sense that the "bus bandwidth utilization" is 100%. Every byte requested was actually used by a thread in the warp, on that read cycle.

"Uncoalesced" access may generically refer to any case which does not fit the above description. Translating into the finer-grained "bus bandwidth utilization" number described above, uncoalesced access may have varying degrees, from a best case of just below 100%, to a worst case of 12.5%, or 3.125%, depending on the exact case and GPU.

A worst-case uncoalesced access pattern example according to this description is given in slide 44 (or 45) here. This doesn't exactly describe your worst case code snippet, but for a large enough sliceSize it is equivalent. The line of code is the same. Considering the same read request (for x, by warp 0, on the first iteration of the loop), the only difference is in the values that i takes on across the warp:

int sliceSize = n / blockDim.x;
int lower = threadIdx.x * sliceSize;
...
for (int i = lower; i < upper; i++) {
    y[i] = x[i] + y[i];

So i starts out at lower, which is just threadIdx.x * sliceSize. Let's suppose that sliceSize is greater than 1. Then the first thread will read location 0. The second thread will read location sliceSize. The third thread will read location 2*sliceSize, etc. These locations are separated by sliceSize distance. Even if sliceSize is only 2, the pattern is still less efficient, as the memory controller must now request twice as many lines or segments, to satisfy this particular read cycle across warp 0. If sliceSize is large enough, the memory controller must request a unique line or segment for each thread, which is the worst-case pattern.

As a final note/takeaway, a useful observation for "quick analysis" can be made:

for optimally coalesced access in most scenarios, we want to be sure that the calculation of the memory index does not involve the multiplication of threadIdx.x by any quantity other than 1.
conversely, if we can show that threadIdx.x in any given indexing calculation is multiplied by some number not equal to 1, then regardless of other considerations, this is a nearly universal indication that the access pattern generated will be non-optimal.

To repeat this for clarity:

index = any_constant_across_the_warp + threadIdx.x;

will generally be an optimal access pattern.

index = any_constant_across_the_warp + C*threadIdx.x;

will generally not be an optimal access pattern. Note that any_constant_across_the_warp can be composed of arbitrary arithmetic on quantities like: the loop index, blockIdx.?, blockDim.?, gridDim.? and any other constants. Some thought must be given to 2D or 3D threadblock patterns, where threadIdx.y will be taken into account, but it's generally not hard to extend this understanding to the 2D case. For typical threadblock shapes, for quick analysis, you generally would not want constant multipliers on either threadIdx.x or threadIdx.y.

This entire discussion applies to global memory reads/writes. Shared memory also has rules for optimal access, which are in some ways similar to the above description and in some ways quite different. However, it is generally true that a fully optimal 100% coalesced pattern for global memory will also be an optimal pattern for shared memory reads/writes. Another way to say that is that adjacent access in a warp is generally optimal for shared memory as well (but it is not the only possible optimal pattern for shared memory).

The presentation already linked here will give a fuller treatment of this topic, as will many other presentations and treatments on the web.

Summing two arrays with CUDA

Tags:

c++

performance

cuda

Ricardo Pieper

1 Answers

Robert Crovella

Recent Activity

Donate For Us

Summing two arrays with CUDA

Tags:

c++

performance

cuda

Ricardo Pieper

1 Answers

Robert Crovella

Related questions

Recent Activity

Donate For Us