I have 'N' threads to perform simultaneously on device which they need M*N float from the global memory. What is the correct way to access the global memory coalesced? In this matter, how the shared memory can help?
Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid holds the index of your thread, then accessing:
arr[tid] --- gives perfect coalescencearr[tid+5] --- is almost perfect, probably misalignedarr[tid*4] --- is not that good anymore, because of the gapsarr[random(0..N)] --- horrible!I am talking from the perspective of a CUDA programmer, but similar rules apply elsewhere as well, even in a simple CPU programming, although the impact is not that big there.
"But I have so many arrays everyone has about 2 or 3 times longer than the number of my threads and using the pattern like "arr[tid*4]" is inevitable. What may be a cure for this?"
If the offset is a multiple of some higher 2-power (e.g. 16*x or 32*x) it is not a problem. So, if you have to process a rather long array in a for-loop, you can do something like this:
for (size_t base=0; i<arraySize; i+=numberOfThreads)
    process(arr[base+threadIndex])
(the above asumes that array size is a multiple of the number of threads)
So, if the number of threads is a multiple of 32, the memory access will be good.
Note again: I am talking from the perspective of a CUDA programmer. For different GPUs/environment you might need less or more threads for perfect memory access coalescence, but similar rules should apply.
Is "32" related to the warp size which access parallel to the global memory?
Although not directly, there is some connection. Global memory is divided into segments of 32, 64 and 128 bytes which are accessed by half-warps. The more segments you access for a given memory-fetch instruction, the longer it goes. You can read more into details in the "CUDA Programming Guide", there is a whole chapter on this topic: "5.3. Maximise Memory Throughput".
In addition, I heard a little about shared memory to localize the memory access. Is this preferred for memory coalescing or have its own difficulties?
Shared memory is much faster as it lies on-chip, but its size is limited. The memory is not segmented like global, you can access is almost-randomly at no penality cost. However, there are memory bank lines of width 4 bytes (size of 32-bit int). The address of memory that each thread access should be different modulo 16 (or 32, depending on the GPU). So, address [tid*4] will be much slower than [tid*5], because the first one access only banks 0, 4, 8, 12 and the latter 0, 5, 10, 15, 4, 9, 14, ... (bank id = address modulo 16).
Again, you can read more in the CUDA Programming Guide.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With