How can I get CUDA Compute capability (version) in compile time by #define? For example, if I use __ballot and compile with
nvcc -c -gencode arch=compute_20,code=sm_20 \
-gencode arch=compute_13,code=sm_13
source.cu
can I get version of compute capability in my code by #define for choose the branch of code with __ballot and without?
CUDA® is a parallel computing platform and programming model that enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).
Run the nvidia-smi command. So below, you can see my GeForce GTX 950 has a computer power of 5.0: The reason for checking this was from a blog on Medium regarding TensorFlow.
Run which nvcc to find if nvcc is installed properly. You should see something like /usr/bin/nvcc. If that appears, your NVCC is installed in the standard directory. If you have installed the CUDA toolkit but which nvcc returns no results, you might need to add the directory to your path.
Yes. First, it's best to understand what happens when you use -gencode
. NVCC will compile your input device code multiple times, once for each device target architecture. So in your example, NVCC will run compilation stage 1 once for compute_20 and once for compute_13.
When nvcc compiles a .cu file, it defines two preprocessor macros, __CUDACC__
and __CUDA_ARCH__
. __CUDACC__
does not have a value, it is simply defined if cudacc is the compiler, and not defined if it isn't.
__CUDA_ARCH__
is defined to an integer value representing the SM version being compiled.
etc. To quote the NVCC documentation included with the CUDA Toolkit:
The architecture identification macro
__CUDA_ARCH__
is assigned a three-digit value stringxy0
(ending in a literal 0) during each nvcc compilation stage 1 that compiles forcompute_xy
. This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
So, in your case where you want to use __ballot()
, you can do this:
....
#if __CUDA_ARCH__ >= 200
int b = __ballot();
int p = popc(b & lanemask);
#else
// do something else for earlier architectures
#endif
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With