Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use coalesced memory access

I have 'N' threads to perform simultaneously on device which they need M*N float from the global memory. What is the correct way to access the global memory coalesced? In this matter, how the shared memory can help?

like image 777
Behzad Baghapour Avatar asked Jul 03 '11 13:07

Behzad Baghapour


1 Answers

Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid holds the index of your thread, then accessing:

  • arr[tid] --- gives perfect coalescence
  • arr[tid+5] --- is almost perfect, probably misaligned
  • arr[tid*4] --- is not that good anymore, because of the gaps
  • arr[random(0..N)] --- horrible!

I am talking from the perspective of a CUDA programmer, but similar rules apply elsewhere as well, even in a simple CPU programming, although the impact is not that big there.


"But I have so many arrays everyone has about 2 or 3 times longer than the number of my threads and using the pattern like "arr[tid*4]" is inevitable. What may be a cure for this?"

If the offset is a multiple of some higher 2-power (e.g. 16*x or 32*x) it is not a problem. So, if you have to process a rather long array in a for-loop, you can do something like this:

for (size_t base=0; i<arraySize; i+=numberOfThreads)
    process(arr[base+threadIndex])

(the above asumes that array size is a multiple of the number of threads)

So, if the number of threads is a multiple of 32, the memory access will be good.

Note again: I am talking from the perspective of a CUDA programmer. For different GPUs/environment you might need less or more threads for perfect memory access coalescence, but similar rules should apply.


Is "32" related to the warp size which access parallel to the global memory?

Although not directly, there is some connection. Global memory is divided into segments of 32, 64 and 128 bytes which are accessed by half-warps. The more segments you access for a given memory-fetch instruction, the longer it goes. You can read more into details in the "CUDA Programming Guide", there is a whole chapter on this topic: "5.3. Maximise Memory Throughput".

In addition, I heard a little about shared memory to localize the memory access. Is this preferred for memory coalescing or have its own difficulties? Shared memory is much faster as it lies on-chip, but its size is limited. The memory is not segmented like global, you can access is almost-randomly at no penality cost. However, there are memory bank lines of width 4 bytes (size of 32-bit int). The address of memory that each thread access should be different modulo 16 (or 32, depending on the GPU). So, address [tid*4] will be much slower than [tid*5], because the first one access only banks 0, 4, 8, 12 and the latter 0, 5, 10, 15, 4, 9, 14, ... (bank id = address modulo 16).

Again, you can read more in the CUDA Programming Guide.

like image 103
CygnusX1 Avatar answered Oct 21 '22 12:10

CygnusX1