When using a C or C++ compiler, if we pass the -O3
switch, execution becomes faster. In CUDA, is there something equivalent?
I am compiling my code using the command nvcc filename.cu
. After that I execute ./a.out
.
Device LTO brings the performance advantages of device code optimization that were only possible in the nvcc whole program compilation mode to the nvcc separate compilation mode, which was introduced in CUDA 5.0.
NVIDIA's CUDA Compiler (NVCC) is based on the widely used LLVM open source compiler infrastructure. Developers can create or extend programming languages with support for GPU acceleration using the NVIDIA Compiler SDK.
1.1. It is the purpose of nvcc , the CUDA compiler driver, to hide the intricate details of CUDA compilation from developers. It accepts a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process.
warning: compiling with nvcc -O3 filename.cu
will pass the -O3 option to host code only.
In order to optimize CUDA kernel code, you must pass optimization flags to the PTX compiler, for example:
nvcc -Xptxas -O3,-v filename.cu
will ask for optimization level 3 to cuda code (this is the default), while -v
asks for a verbose compilation, which reports very useful information we can consider for further optimization techniques (more on this later).
Another speed optimization flag available for nvcc compiler is the -use_fast_math
which will use intrinsics at the expense of floating-point precision (see Options for Steering GPU code generation).
Anyway, from my experience, such automatic compiler optimization options do not achieve in general great boosts. Best performances can be achieved through explicit coding optimizations, such as:
-maxrrregcount=N
option. The less registers a kernel requires, the more blocks are eligible to run concurrently (until register spilling will take over).#pragma unroll N
before any independent loop, if any, inside your CUDA kernel. N can be 2,3,4. Best results are met when you reach a good balance between register pressure and achieved unrolling level. This approach falls into the ILP technique, afterall.float A[N],B[N]
, into one buffer of float2 AB[N]
data. This will translate into less operations for the load/store units and bus usage efficiency.Of course, always, always, always check your code to have coalesced memory accesses to global memory and avoiding bank conflicts in shared memory. Use the nVIDIA Visual Profiler to get a deeper insight of such issues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With