CUDA code compiled with a higher compute capability will execute perfectly for a long time on a device with lower compute capability, before silently failing one day in some kernel. I spent half a day chasing an elusive bug only to realize that the Build Rule had sm_21
while the device (Tesla C2050) was a 2.0
.
Is there any CUDA API code I can add which can self-check if it is running on a device with compatible compute capability? I need to compile and work with devices of many compute capabilities. Is there any other action I can take to ensure such errors do not occur?
In the runtime API, cudaGetDeviceProperties returns two fields major
and minor
which return the compute capability any given enumerated CUDA device. You can use that to parse the compute capability of any GPU before establishing a context on it to make sure it is the right architecture for what your code does. nvcc
can generate a object file containing multiple architectures from a single invocation using the -gencode
option, for example:
nvcc -c -gencode arch=compute_20,code=sm_20 \
-gencode arch=compute_13,code=sm_13 \
source.cu
would produce an output object file with an embedded fatbinary object containing cubin files for GT200 and GF100 cards. The runtime API will automagically handle architecture detection and try loading suitable device code from the fatbinary object without any extra host code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With