Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

Question

const char programSource[] =
        "__kernel void vecAdd(__global int *a, __global int *b, __global int *c)"
        "{"
        "    int gid = get_global_id(0);"
        "for(int i=0; i<10; i++){"
        "    a[gid] = b[gid] + c[gid];}"
        "}";

The kernel above is a vector addition done ten times per loop. I have used the programming guide and stack overflow to figure out how global memory works, but I still can't figure out by looking at my code if I am accessing global memory in a good way. I am accessing it in a contiguous fashion and I am guessing in an aligned way. Does the card load 128kb chunks of global memory for arrays a, b, and c? Does it then load the 128kb chunks for each array once for every 32 gid indexes processed? (4*32=128) It seems like then I am not wasting any global memory bandwidth right?

BTW, the compute profiler shows a gld and gst efficiency of 1.00003, which seems weird, I thought it would just be 1.0 if all my stores and loads were coalesced. How is it above 1.0?

Grizzly · Accepted Answer

Yes your memory access pattern is pretty much optimal. Each halfwarp is accessing 16 consecutive 32bit words. Furthermore the access is 64byte aligned, since the buffers themselves are aligned and the startindex for each halfwarp is a multiple of 16. So each halfwarp will generate one 64Byte transaction. So you shouldn't waste memory bandwidth through uncoalesced accesses.

Since you asked for examples in your last question lets modify this code for other (less optimal access pattern (since the loop doesn't really do anything I will ignore that):

kernel void vecAdd(global int* a, global int* b, global int* c)
{
   int gid = get_global_id(0);
   a[gid+1] = b[gid * 2] + c[gid * 32];
}

At first lets se how this works on compute 1.3 (GT200) hardware

For the writes to a this will generate a slightly unoptimal pattern (following the halfwarps identified by their id range and the corresponding access pattern):

   gid  | addr. offset | accesses     | reasoning
  0- 15 |     4- 67    | 1x128B       | in aligned 128byte block
 16- 31 |    68-131    | 1x64B, 1x32B | crosses 128B boundary, so no 128B access
 32- 47 |   132-195    | 1x128B       | in aligned 128byte block
 48- 63 |   196-256    | 1x64B, 1x32B | crosses 128B boundary, so no 128B access

So basically we are wasting about half our bandwidth (the less then doubled access width for the odd halfwarps doesn't help much because it generates more accesses, which isn't faster then wasting more bytes so to speak).

For the reads from b the threads access only even elements of the array, so for each halfwarp all accesses lie in a 128byte aligned block (the first element is at the 128B boundary, since for that element the gid is a multiple of 16=> the index is a multiple of 32, for 4 byte elements, that means the address offset is a multiple of 128B). The accesspattern stretches over the whole 128B block, so this will do a 128B transfer for every halfwarp, again waisting half the bandwidth.

The reads from c generate one of the worst case scenarios, where each thread indices in its own 128B block, so each thread needs its own transfer, which one one hand is a bit of a serialization scenario (although not quite as bad as normaly, since the hardware should be able to overlap the transfers). Whats worse is the fact that this will transfer a 32B block for each thread, wasting 7/8 of the bandwidth (we access 4B/thread, 32B/4B=8, so only 1/8 of the bandwidth is utilized). Since this is the accesspattern of naive matrixtransposes, it is highly advisable to do those using local memory (speaking from experience).

Compute 1.0 (G80)

Here the only pattern which will create a good access is the original, all patterns in the example will create completely uncoalesced access, wasting 7/8 of the bandwidth (32B transfer/thread, see above). For G80 hardware every access where the nth thread in a halfwarp doesn't access the nth element creates such uncoalesced accesses

Compute 2.0 (Fermi)

Here every access to memory creates 128B transactions (as many as necessary to gather all data, so 16x128B in the worst case), however those are cached, making it less obvious where data will be transfered. For the moment lets assume the cache is big enough to hold all data and there are no conflicts, so every 128B cacheline will be transferred at most once. Lets furthermoe assume a serialized execution of the halfwarps, so we have a deterministic cache occupation.

Accesses to b will still always transfer 128B Blocks (no other thread indices in the coresponding memoryarea). Access to c will generate 128B transfers per thread (worst accesspattern possible).

For accesses to a it is the following (treating them like reads for the moment):

   gid  | offset  | accesses | reasoning
  0- 15 |   4- 67 |  1x128B  | bringing 128B block to cache
 16- 31 |  68-131 |  1x128B  | offsets 68-127 already in cache, bring 128B for 128-131 to cache
 32- 47 | 132-195 |    -     | block already in cache from  last halfwarp
 48- 63 | 196-259 |  1x128B  | offsets 196-255 already in cache, bringing in 256-383

So for large arrays the accesses to a will waste almost no bandwidth theoretically. For this example the reality is of course not quite as good, since the accesses to c will trash the cache pretty nicely

For the profiler I would assume that the efficiencies over 1.0 are simply results of floating point inaccurencies.

Hope that helps

Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

Tags:

memory

cuda

nvidia

opencl

smuggledPancakes

1 Answers

Grizzly

Recent Activity

Donate For Us

Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

Tags:

memory

cuda

nvidia

opencl

smuggledPancakes

1 Answers

Grizzly

Related questions

Recent Activity

Donate For Us