Un-coalesced global memory access caused by indirect access in CUDA

Tags:

My CUDA programm is suffering from un-coalesced global memory access. Although the idx-th thread only deal with the [idx]-th cell in an array, there are many indirect memory accesses as shown below.

Click to copy

int idx=blockDim.x*blockIdx.x+threadIdx.x;

.... = FF[m_front[m_fside[idx]]];

For m_fisde[idx], we have coalesced accesses, but what we actually need is FF[m_front[m_fside[idx]]]. There is a two-level indirect access.

I tried to find some patterns of the data in m_front or m_fsied in order to make this to be a direct sequential access, but found out that they are almost 'random'.

Is there a possible way to tackle this?

661

asked Feb 28 '13 05:02

thierry

1 Answers

Accelerating global memory random access: Invalidating the L1 cache line

Fermi and Kepler architectures support two types of loads from global memory. Full caching is the default mode, it attempts to hit in L1, then L2, then GMEM and the load granularity is 128-byte line. L2-only attempts to hit in L2, then GMEM and the load granularity is 32-bytes. For certain random access patterns, memory efficiency can be increased by invalidating L1 and exploiting the lower granularity of L2. This can be done by compiling with –Xptxas –dlcm=cg option to nvcc.

General guidelines for accelerating global memory access: disabling ECC support

Fermi and Kepler GPUs support Error Correcting Code (ECC), and ECC is enabled by default. ECC reduces peak memory bandwidth and is requested to enhance data integrity in applications like medical imaging and large-scale cluster computing. If not needed, it can be disabled for improved performance using the nvidia-smi utility on Linux (see the link), or via Control Panel on Microsoft Windows systems. Note that toggling ECC on or off requires a reboot to take effect.

General guidelines for accelerating global memory access on Kepler: using read-only data cache

Kepler features a 48KB cache for data that is known to be read‐only for the duration of the function. Use of the read‐only path is beneficial because it offloads the Shared/L1 cache path and it supports full speed unaligned memory access. Use of the read‐only path can be managed automatically by the compiler (use the const __restrict keyword) or explicitly (use the __ldg() intrinsic) by the programmer.

130

answered Nov 15 '22 08:11

Vitality

Related questions
                            
                                /usr/bin/ld: cannot find -lcutil_x86_64
                            
                                CUDA: Atomic operations on unsigned chars
                            
                                GPU L1 and L2 cache statistics
                            
                                cuda context creation and resource association in runtime API applications
                            
                                How do I change the output filename of cuda_compile_ptx in CMake?
                            
                                In CUDA, do non-coalesced memory accesses cause branch divergence?
                            
                                Inactive threads vs. predicated off threads in CUDA
                            
                                1D FFTs of columns and rows of a 3D matrix in CUDA
                            
                                Managing properly an array of results that is larger than the memory available at the GPU?
                            
                                Summing two arrays with CUDA
                            
                                Concurrency, 4 CUDA Applications competing to get GPU resources
                            
                                Qt and CUDA VIsual Profiler error in memory transfer size
                            
                                Remote CUDA profiling?
                            
                                Efficiently transfer large file (up to 2GB) to CUDA GPU?
                            
                                CUDA /openCL; rewriting branches as non-branching expression
                            
                                Thrust vectorized search: Efficiently combine lower_bound and binary_search to find both position and existence
                            
                                push_back using Thrust library
                            
                                CUDA parallelizing a nested for loop
                            
                                Disappointing results in pyCUDA benchmark for distance computing between N points
                            
                                Divergence in CUDA - exit from a thread in kernel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Un-coalesced global memory access caused by indirect access in CUDA

Tags:

cuda

gpgpu

gpu

thierry

People also ask

1 Answers

Vitality

Recent Activity

Donate For Us