I am implementing an application using CUDA with a compute capability 1.3
GPU that involves scanning a two-dimensional array for the locations where a smaller two-dimensional array occurs. Up until now, both arrays were allocated using cudaMallocPitch()
and transferred using cudaMemcpy2D()
to meet the memory alignment requirements for coalescing.
During the first optimization steps, I am trying to coalescence the memory accesses to global memory by collectively reading data to the shared memory. As a test in the unoptimized code (where, for example, there is divergent branching and the memory accesses to the global memory are not coalesced ) I allocated the bigger array using cudaMalloc()
and found that the performance improved by a factor of up to 50%
. How is this possible?
cudaMallocPitch() ensure that the starting address of each row in the 2-D array (row-major) is a multiple of 2^N (N is 7~10 depending on the compute capability).
Whether the accesss is more efficient depends on not only the data alignment but also your compute capability, global mem access manner and sometimes the cache configuration.
This blog explains the great bandwidth reduction of mis-aligned data access on early compute capability, which could be an A to your Q.
https://developer.nvidia.com/content/how-access-global-memory-efficiently-cuda-cc-kernels
Since the performance depends on many factors, you may have to post your device module type and the kernel code as well to allow further investigation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With