In CUDA, what is memory coalescing, and how is it achieved?

Tags:

What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced? Which is correct and why?

667

asked Feb 18 '11 12:02

kar

1 Answers

It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.

Also, this logic can be applied to shared memory to avoid bank conflicts.

A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.

So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.

In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below

0 1 2 3 4 5 6 7 8 9 a b

could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)

0 1 2 3 4 5 6 7 8 9 a b

Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either

thread 0:  0, 1, 2 thread 1:  3, 4, 5 thread 2:  6, 7, 8 thread 3:  9, a, b

thread 0:  0, 4, 8 thread 1:  1, 5, 9 thread 2:  2, 6, a thread 3:  3, 7, b

Which is better? Which will result in coalesced reads, and which will not?

Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!

The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.

140

answered Sep 20 '22 17:09

jmilloy

Related questions
                            
                                Use of cudamalloc(). Why the double pointer?
                            
                                How can I compile CUDA code then link it to a C++ project?
                            
                                Structure of Arrays vs Array of Structures
                            
                                Python GPU programming [closed]
                            
                                What is the difference between cuda vs tensor cores?
                            
                                Compression library using Nvidia's CUDA [closed]
                            
                                Error compiling CUDA from Command Prompt
                            
                                How and when should I use pitched pointer with the cuda API?
                            
                                Does __syncthreads() synchronize all threads in the grid?
                            
                                Cuda gridDim and blockDim
                            
                                CUDA or FPGA for special purpose 3D graphics computations? [closed]
                            
                                Does CUDA support recursion?
                            
                                Coding CUDA with C#?
                            
                                CUDA determining threads per block, blocks per grid
                            
                                Error Message : Cannot find or open the PDB file
                            
                                How can I flush GPU memory using CUDA (physical reset is unavailable)
                            
                                GPU Programming, CUDA or OpenCL? [closed]
                            
                                When to call cudaDeviceSynchronize?
                            
                                Passing pointers between C and Java through JNI
                            
                                LNK2038: mismatch detected for 'RuntimeLibrary': value 'MT_StaticRelease' doesn't match value 'MD_DynamicRelease' in file.obj

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In CUDA, what is memory coalescing, and how is it achieved?

Tags:

definition

cuda

memory-access

kar

People also ask

1 Answers

jmilloy

Recent Activity

Donate For Us