I know that nvidia-smi -l 1
will give the GPU usage every one second (similarly to the following). However, I would appreciate an explanation on what Volatile GPU-Util
really means. Is that the number of used SMs over total SMs, or the occupancy, or something else?
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.48 Driver Version: 367.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20c Off | 0000:03:00.0 Off | 0 | | 30% 41C P0 53W / 225W | 0MiB / 4742MiB | 96% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20c Off | 0000:43:00.0 Off | 0 | | 36% 49C P0 95W / 225W | 4516MiB / 4742MiB | 63% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 1 5193 C python 4514MiB | +-----------------------------------------------------------------------------+
NVIDIA-smi ships with NVIDIA GPU display drivers on Linux, and with 64bit Windows Server 2008 R2 and Windows 7. Nvidia-smi can report query information as XML or human readable plain text to either standard output or a file.
NVIDIA GPUs provide an error count of ECC errors. Here, Volatile error counter detects error count since the last driver loaded. GPU-Util: It indicates the percent of GPU utilization i.e. percent of the time when kernels were using GPU over the sample period.
Take this screenshot as an example: image881×418 16.2 KB. Let the top-middle value, which in the format {Used} / Total MiB , be called Overall Memory Usage . Let the bottom-right value, which is in the format {Used} MiB , be called GPU Memory Usage .
It is a sampled measurement over a time period. For a given time period, it reports what percentage of time one or more GPU kernel(s) was active (i.e. running).
It doesn't tell you anything about how many SMs were used, or how "busy" the code was, or what it was doing exactly, or in what way it may have been using memory.
The above claim(s) can be verified without too much difficulty using a microbenchmarking-type exercise (see below).
Based on the Nvidia docs, The sample period may be between 1 second and 1/6 second depending on the product. However, the period shouldn't make much difference on how you interpret the result.
Also, the word "Volatile" does not pertain to this data item in nvidia-smi
. You are misreading the output format.
Here's a trivial code that supports my claim:
#include <stdio.h> #include <unistd.h> #include <stdlib.h> const long long tdelay=1000000LL; const int loops = 10000; const int hdelay = 1; __global__ void dkern(){ long long start = clock64(); while(clock64() < start+tdelay); } int main(int argc, char *argv[]){ int my_delay = hdelay; if (argc > 1) my_delay = atoi(argv[1]); for (int i = 0; i<loops; i++){ dkern<<<1,1>>>(); usleep(my_delay);} return 0; }
On my system, when I run the above code with a command line parameter of 100, nvidia-smi will report 99% utilization. When I run with a command line parameter of 1000, nvidia-smi will report ~83% utilization. When I run it with a command line parameter of 10000, nvidia-smi will report ~9% utilization.
The 'Volatile' on nvidia-smi isn't part of GPU-Util, it's part of 'Volatile Uncorr. ECC', which shows the number of uncorrected errors that have occurred on the GPU since the last driver load. There's a good writeup of everything in nvidia-smi here:
https://medium.com/analytics-vidhya/explained-output-of-nvidia-smi-utility-fc4fbee3b124
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With