Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is pitched memory allocation more efficient with two-dimensional arrays?

I am implementing an application using CUDA with a compute capability 1.3 GPU that involves scanning a two-dimensional array for the locations where a smaller two-dimensional array occurs. Up until now, both arrays were allocated using cudaMallocPitch() and transferred using cudaMemcpy2D() to meet the memory alignment requirements for coalescing.

During the first optimization steps, I am trying to coalescence the memory accesses to global memory by collectively reading data to the shared memory. As a test in the unoptimized code (where, for example, there is divergent branching and the memory accesses to the global memory are not coalesced ) I allocated the bigger array using cudaMalloc() and found that the performance improved by a factor of up to 50%. How is this possible?

like image 873
charis Avatar asked Feb 05 '13 19:02

charis


1 Answers

cudaMallocPitch() ensure that the starting address of each row in the 2-D array (row-major) is a multiple of 2^N (N is 7~10 depending on the compute capability).

Whether the accesss is more efficient depends on not only the data alignment but also your compute capability, global mem access manner and sometimes the cache configuration.

This blog explains the great bandwidth reduction of mis-aligned data access on early compute capability, which could be an A to your Q.

https://developer.nvidia.com/content/how-access-global-memory-efficiently-cuda-cc-kernels

Since the performance depends on many factors, you may have to post your device module type and the kernel code as well to allow further investigation.

like image 191
kangshiyin Avatar answered Oct 05 '22 22:10

kangshiyin