My question concerns the coalesced global writes to a dynamically changing set of elements of an array in CUDA. Consider the following kernel:
__global__ void
kernel (int n, int *odata, int *idata, int *hash)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
odata[hash[i]] = idata[i];
}
Here the first n elements of the array hash contain the indices of odata to be updated from the first n elements of idata. Obviously this leads to a terrible, terrible lack of coalescence. In the case of my code, the hash at one kernel invocation is completely unrelated to the hash at another (and other kernels update the data in other ways), so simply reordering the data to optimize this particular kenrel isn't an option.
Is there some feature in CUDA which would allow me to improve the performance of this situation? I hear a lot of talk about texture memory, but I've not been able to translate what I've read into a solution for this problem.
Texturing is a read-only mechanism, so it cannot directly improve the performance of scattered writes to GMEM. If you were "hashing" like this instead:
odata[i] = idata[hash[i]];
(perhaps your algorithm can be transformed?)
Then there might be some benefit to considering a Texture mechanism. (Your example appears to be 1D in nature).
You might also make sure that your shared memory/L1 split is optimized towards cache. This won't help much with scattered writes though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With