I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access. Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.
__shared__ int sData[256];
When is coalesced access performed?
1.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];
Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?
2.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];
If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?
Coalesced memory access or memory coalescing refers to combining multiple memory accesses into a single transaction. On the K20 GPUs on Stampede, every successive 128 bytes ( 32 single precision words) memory can be accessed by a warp (32 consecutive threads) in a single transaction.
Global memory can be considered the main memory space of the GPU in CUDA. It is allocated, and managed, by the host, and it is accessible to both the host and the GPU, and for this reason the global memory space can be used to exchange data between the two.
: not united or grown together : not coalesced uncoalesced particles/layers.
Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.
What you want ultimately depends on whether your input data is a 1D or 2D array, and whether your grid and blocks are 1D or 2D. The simplest case is both 1D:
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + threadIdx.x];
This is coalesced. The rule of thumb I use is that the most rapidly varying coordinate (the threadIdx) is added on as offset to the block offset (blockDim * blockIdx). The end result is that the indexing stride between threads in the block is 1. If the stride gets larger, then you lose coalescing.
The simple rule (on Fermi and later GPUs) is that if the addresses for all threads in a warp fall into the same aligned 128-byte range, then a single memory transaction will result (assuming caching is enabled for the load, which is the default). If they fall into two aligned 128-byte ranges, then two memory transactions result, etc.
On GT2xx and earlier GPUs, it gets more complicated. But you can find the details of that in the programming guide.
Additional examples:
Not coalesced:
shmem[threadIdx.x] = gmem[blockDim.x + blockIdx.x * threadIdx.x];
Not coalesced, but not too bad on GT200 and later:
stride = 2;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];
Not coalesced at all:
stride = 32;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];
Coalesced, 2D grid, 1D block:
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.x] = gmem[blockIdx.y * elementPitch +
blockIdx.x * blockDim.x + threadIdx.x];
Coalesced, 2D grid and block:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.y * blockDim.x + threadIdx.x] = gmem[y * elementPitch + x];
Your indexing at 1 is wrong (or intentionally so strange it seems wrong), some blocks access same element in each thread, so there is no way for coalesced access in these blocks.
Proof:
Example:
Grid = dim(2,2,0)
t(blockIdx.x, blockIdx.y)
//complete block reads at 0
t(0,0) -> sData[threadIdx.x] = gData[0];
//complete block reads at 2
t(0,1) -> sData[threadIdx.x] = gData[2];
//definetly coalesced
t(1,0) -> sData[threadIdx.x] = gData[threadIdx.x];
//not coalesced since 2 is no multiple of a half of the warp size = 16
t(1,1) -> sData[threadIdx.x] = gData[threadIdx.x + 2];
So its a "luck" game if a block is coalesced, so in general No
But coalesced memory reads rules are not as strict on newer cuda versions as before.
But for compatibility issues you should try to optimise kernels for lowest cuda versions, if it is possible.
Here is some nice source:
http://mc.stanford.edu/cgi-bin/images/0/0a/M02_4.pdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With