So lets say I have a global array of memory:
|a|b|c| |e|f|g| |i|j|k| |
There are four 'threads' (local work items in OpenCL) accessing this memory, and two possible patterns for this access (columns are time slices, rows are threads):
0 -> 1 -> 2 -> 3
t1 a -> b -> c -> .
t2 e -> f -> g -> .
t3 i -> j -> k -> .
t4 . . . `> .
The above pattern splits the array in to blocks with each thread iterating to and accessing the next element in a block per time slice. I believe this sort of access would work well for CPUs because it maximizes cache locality per thread. Also, loops utilizing this pattern can be easily unrolled by the compiler.
The second pattern:
0 -> 1 -> 2 -> 3
t1 a -> e -> i -> .
t2 b -> f -> j -> .
t3 c -> g -> k -> .
t4 . . . `> .
The above pattern accesses memory in strides: for example, thread 1 accesses a, then e, then i etc. This maximizes cache locality per unit time. Consider you have 64 work-items 'striding' at any given time slice. This means that, with a cache-line size of 64 bytes and elements of sizeof(float), work-items 1-16's read are cached by work-item 1's read. The data width/count per cell (where 'a' is a cell from above) has to be chosen carefully to avoid misaligned access. These loops don't seem to unroll as easily (or at all using Intel's Kernel Builder with the CPU). I believe this pattern would work well with a GPU.
I'm targeting GPUs with cache hierarchies. Specifically AMD's latest architecture (GCN). Is the second access pattern an example of 'coalescing'? Am I wrong in my thought process somewhere?
I think the answer depends on whether or not the accesses are to global or local memory. If you are pulling the data from global memory, then you need to worry about coalescing the reads (ie contiguous blocks, second example). However, if you are pulling the data from local memory, then you need to worry about bank conflicts. I have some but not a lot of experience, so I'm not stating this as absolute truth.
Edit: After reading up on GCN, I don't think the caches make a difference here. You can basically think of them as just speeding up global memory if you repeatedly read/write the same elements. On a side note, thanks for asking the question, because reading up on the new architecture is pretty interesting.
Edit 2: Here's a nice Stack Overflow discussion of banks for local and global memory: Why aren't there bank conflicts in global memory for Cuda/OpenCL?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With