Is pitched memory allocation more efficient with two-dimensional arrays?

Question

I am implementing an application using CUDA with a compute capability 1.3 GPU that involves scanning a two-dimensional array for the locations where a smaller two-dimensional array occurs. Up until now, both arrays were allocated using cudaMallocPitch() and transferred using cudaMemcpy2D() to meet the memory alignment requirements for coalescing.

During the first optimization steps, I am trying to coalescence the memory accesses to global memory by collectively reading data to the shared memory. As a test in the unoptimized code (where, for example, there is divergent branching and the memory accesses to the global memory are not coalesced ) I allocated the bigger array using cudaMalloc() and found that the performance improved by a factor of up to 50%. How is this possible?

kangshiyin · Accepted Answer

cudaMallocPitch() ensure that the starting address of each row in the 2-D array (row-major) is a multiple of 2^N (N is 7~10 depending on the compute capability).

Whether the accesss is more efficient depends on not only the data alignment but also your compute capability, global mem access manner and sometimes the cache configuration.

This blog explains the great bandwidth reduction of mis-aligned data access on early compute capability, which could be an A to your Q.

https://developer.nvidia.com/content/how-access-global-memory-efficiently-cuda-cc-kernels

Since the performance depends on many factors, you may have to post your device module type and the kernel code as well to allow further investigation.

Is pitched memory allocation more efficient with two-dimensional arrays?

Tags:

c

optimization

cuda

charis

1 Answers

kangshiyin

Recent Activity

Donate For Us

Is pitched memory allocation more efficient with two-dimensional arrays?

Tags:

c

optimization

cuda

charis

1 Answers

kangshiyin

Related questions

Recent Activity

Donate For Us