CUDA coalesced access to global memory

Tags:

I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access. Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.

__shared__ int sData[256];

When is coalesced access performed?

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];

Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];

If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?

800

asked Apr 25 '12 23:04

Hlavson

2 Answers

What you want ultimately depends on whether your input data is a 1D or 2D array, and whether your grid and blocks are 1D or 2D. The simplest case is both 1D:

shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + threadIdx.x];

This is coalesced. The rule of thumb I use is that the most rapidly varying coordinate (the threadIdx) is added on as offset to the block offset (blockDim * blockIdx). The end result is that the indexing stride between threads in the block is 1. If the stride gets larger, then you lose coalescing.

The simple rule (on Fermi and later GPUs) is that if the addresses for all threads in a warp fall into the same aligned 128-byte range, then a single memory transaction will result (assuming caching is enabled for the load, which is the default). If they fall into two aligned 128-byte ranges, then two memory transactions result, etc.

On GT2xx and earlier GPUs, it gets more complicated. But you can find the details of that in the programming guide.

Additional examples:

Not coalesced:

shmem[threadIdx.x] = gmem[blockDim.x + blockIdx.x * threadIdx.x];

Not coalesced, but not too bad on GT200 and later:

stride = 2;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

Not coalesced at all:

stride = 32;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

Coalesced, 2D grid, 1D block:

int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.x] = gmem[blockIdx.y * elementPitch + 
                          blockIdx.x * blockDim.x + threadIdx.x];

Coalesced, 2D grid and block:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.y * blockDim.x + threadIdx.x] = gmem[y * elementPitch + x];

188

answered Sep 28 '22 16:09

harrism

Your indexing at 1 is wrong (or intentionally so strange it seems wrong), some blocks access same element in each thread, so there is no way for coalesced access in these blocks.

Proof:

Example:

Grid = dim(2,2,0)

t(blockIdx.x, blockIdx.y)

//complete block reads at 0
t(0,0) -> sData[threadIdx.x] = gData[0];
//complete block reads at 2
t(0,1) -> sData[threadIdx.x] = gData[2];
//definetly coalesced
t(1,0) -> sData[threadIdx.x] = gData[threadIdx.x];
//not coalesced since 2 is no multiple of a half of the warp size = 16
t(1,1) -> sData[threadIdx.x] = gData[threadIdx.x + 2];

So its a "luck" game if a block is coalesced, so in general No

But coalesced memory reads rules are not as strict on newer cuda versions as before.
But for compatibility issues you should try to optimise kernels for lowest cuda versions, if it is possible.

Here is some nice source:

http://mc.stanford.edu/cgi-bin/images/0/0a/M02_4.pdf

answered Sep 28 '22 17:09

djmj

Related questions
                            
                                Does each core has its own private set of registers?
                            
                                How to increase the swap space available in the boot2docker virtual machine?
                            
                                What to consider with regards to alignment when designing a memory pool?
                            
                                C-extension in Python - return Py_BuildValue() memory leak problem
                            
                                Convert float to string without sprintf()
                            
                                Getting memory usage Live/Dirty Bytes in iOS app programmatically (not Resident/Real Bytes)
                            
                                What is the difference Between Assignment and Creating Instance of String? [duplicate]
                            
                                How much memory is allocated for one Integer object in Java? How to find out this value for any custom object? [duplicate]
                            
                                What real platforms map hardware ports to memory addresses?
                            
                                maximum size of a matrix in R
                            
                                increase the memory allocated to application
                            
                                Example of a memory consistency error when using volatile keyword?
                            
                                Does PHP free local variables immediately after the function ends?
                            
                                Memory footprint of existentially quantified types and related optimization techniques
                            
                                java.lang.OutOfMemoryError: PermGen space in play framework
                            
                                Is there any negative performance implication to using local functions in Rust?
                            
                                PHP memory_get_usage
                            
                                Using sun.misc.Unsafe to get address of Java array items?
                            
                                .NET process memory usage = 5x CLR Heap Memory?
                            
                                Long SQL file runs SQL Server out of memory (22,000 lines)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CUDA coalesced access to global memory

Tags:

memory

copy

cuda

coalescing

Hlavson

People also ask

2 Answers

harrism

djmj

Recent Activity

Donate For Us