Suppose I have a system with a single GPU installed, and suppose I've also installed a recent version of CUDA.
I want to determine what's the compute capability of my GPU. If I could compile code, that would be easy:
#include <stdio.h>
int main() {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
printf("%d", prop.major * 10 + prop.minor);
}
but - suppose I want to do that without compiling. Can I? I thought nvidia-smi
might help me, since its lets you query all sorts of information about devices, but it seems it doesn't let you obtain the compute capability. Maybe there's something else I can do? Maybe something visible via /proc
or system logs?
Edit: This is intended to run before a build, on a system which I don't control. So it must have minimal dependencies, run on a command-line and not require root privileges.
The compute capability identifies the features supported by the GPU hardware. It is used by applications at run time to determine which hardware features, instructions are available on the GPU device.
2.1. You can verify that you have a CUDA-capable GPU through the Display Adapters section in the Windows Device Manager. Here you will find the vendor name and model of your graphics card(s). If you have an NVIDIA card that is listed in http://developer.nvidia.com/cuda-gpus, that GPU is CUDA-capable.
Unfortunately, it looks like the answer at the moment is "No", and that one needs to either compile a program or use a binary compiled elsewhere.
Edit: I have adapted a workaround for this issue - a self-contained bash script which compiles a small built-in C program to determine the compute capability. (It is particualrly useful to call from with CMake, but can just run independently.)
Also, I've filed a feature-requesting bug report at nVIDIA about this.
Here's the script, in a version assuming that nvcc
is on your path:
//usr/bin/env nvcc --run "$0" ${1:+--run-args "${@:1}"} ; exit $?
#include <cstdio>
#include <cstdlib>
#include <cuda_runtime_api.h>
int main(int argc, char *argv[])
{
cudaDeviceProp prop;
cudaError_t status;
int device_count;
int device_index = 0;
if (argc > 1) {
device_index = atoi(argv[1]);
}
status = cudaGetDeviceCount(&device_count);
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceCount() failed: %s\n", cudaGetErrorString(status));
return -1;
}
if (device_index >= device_count) {
fprintf(stderr, "Specified device index %d exceeds the maximum (the device count on this system is %d)\n", device_index, device_count);
return -1;
}
status = cudaGetDeviceProperties(&prop, device_index);
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceProperties() for device device_index failed: %s\n", cudaGetErrorString(status));
return -1;
}
int v = prop.major * 10 + prop.minor;
printf("%d\n", v);
}
You can use deviceQuery
utility included in cuda installation
# change cwd into utility source directoy
$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
# build deviceQuery utility with make as root
$ sudo make
# run deviceQuery
$ ./deviceQuery | grep Capability
CUDA Capability Major/Minor version number: 7.5
# optionally copy deviceQuery in ~/bin for future use
$ cp ./deviceQuery ~/bin
Full ouput from deviceQuery with RTX2080Ti is follows:
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce RTX 2080 Ti"
CUDA Driver Version / Runtime Version 11.2 / 10.2
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 11016 MBytes (11551440896 bytes)
(68) Multiprocessors, ( 64) CUDA Cores/MP: 4352 CUDA Cores
GPU Max Clock rate: 1770 MHz (1.77 GHz)
Memory Clock rate: 7000 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 5767168 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
Thanks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With