Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Un-coalesced global memory access caused by indirect access in CUDA

Tags:

cuda

gpgpu

gpu

My CUDA programm is suffering from un-coalesced global memory access. Although the idx-th thread only deal with the [idx]-th cell in an array, there are many indirect memory accesses as shown below.

int idx=blockDim.x*blockIdx.x+threadIdx.x;

.... = FF[m_front[m_fside[idx]]];

For m_fisde[idx], we have coalesced accesses, but what we actually need is FF[m_front[m_fside[idx]]]. There is a two-level indirect access.

I tried to find some patterns of the data in m_front or m_fsied in order to make this to be a direct sequential access, but found out that they are almost 'random'.

Is there a possible way to tackle this?

like image 661
thierry Avatar asked Feb 28 '13 05:02

thierry


People also ask

What is coalesced memory access?

Coalesced memory access or memory coalescing refers to combining multiple memory accesses into a single transaction. On the K20 GPUs on Stampede, every successive 128 bytes ( 32 single precision words) memory can be accessed by a warp (32 consecutive threads) in a single transaction.

What is global memory in CUDA?

Global memory can be considered the main memory space of the GPU in CUDA. It is allocated, and managed, by the host, and it is accessible to both the host and the GPU, and for this reason the global memory space can be used to exchange data between the two.

What is coalesce in CUDA?

Memory coalescing is a technique which allows optimal usage of the global memory bandwidth. That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.

How do you use constant memory in CUDA?

A variable allocated in constant memory needs to be declared in CUDA by using the special __constant__ identifier, and it must be a global variable, i.e. it must be declared in the scope that contains the kernel, not inside the kernel itself.


1 Answers

Accelerating global memory random access: Invalidating the L1 cache line

Fermi and Kepler architectures support two types of loads from global memory. Full caching is the default mode, it attempts to hit in L1, then L2, then GMEM and the load granularity is 128-byte line. L2-only attempts to hit in L2, then GMEM and the load granularity is 32-bytes. For certain random access patterns, memory efficiency can be increased by invalidating L1 and exploiting the lower granularity of L2. This can be done by compiling with –Xptxas –dlcm=cg option to nvcc.

General guidelines for accelerating global memory access: disabling ECC support

Fermi and Kepler GPUs support Error Correcting Code (ECC), and ECC is enabled by default. ECC reduces peak memory bandwidth and is requested to enhance data integrity in applications like medical imaging and large-scale cluster computing. If not needed, it can be disabled for improved performance using the nvidia-smi utility on Linux (see the link), or via Control Panel on Microsoft Windows systems. Note that toggling ECC on or off requires a reboot to take effect.

General guidelines for accelerating global memory access on Kepler: using read-only data cache

Kepler features a 48KB cache for data that is known to be read‐only for the duration of the function. Use of the read‐only path is beneficial because it offloads the Shared/L1 cache path and it supports full speed unaligned memory access. Use of the read‐only path can be managed automatically by the compiler (use the const __restrict keyword) or explicitly (use the __ldg() intrinsic) by the programmer.

like image 130
Vitality Avatar answered Nov 15 '22 08:11

Vitality