I'm running an AWS EC2 g2.2xlarge instance with Ubuntu 14.04 LTS. I'd like to observe the GPU utilization while training my TensorFlow models. I get an error trying to run 'nvidia-smi'.
ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls nvidia-bug-report.sh nvidia-debugdump nvidia-xconfig nvidia-cuda-mps-control nvidia-persistenced nvidia-cuda-mps-server nvidia-smi ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia ii nvidia-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-346 ii nvidia-346-dev 346.46-0ubuntu1 amd64 NVIDIA binary Xorg driver development files ii nvidia-346-uvm 346.96-0ubuntu0.0.1 amd64 Transitional package for nvidia-346 ii nvidia-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-375 ii nvidia-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary driver - version 375.39 ii nvidia-375-dev 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary Xorg driver development files ii nvidia-modprobe 375.26-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files ii nvidia-opencl-icd-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-opencl-icd-352 ii nvidia-opencl-icd-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-opencl-icd-375 ii nvidia-opencl-icd-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA OpenCL ICD ii nvidia-prime 0.6.2.1 amd64 Tools to enable NVIDIA's Prime ii nvidia-settings 375.26-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1) ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ $ inxi -G Graphics: Card-1: Cirrus Logic GD 5446 Card-2: NVIDIA GK104GL [GRID K520] X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X $ lspci -k | grep -A 2 -E "(VGA|3D)" 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 Subsystem: XenSource, Inc. Device 0001 Kernel driver in use: cirrus 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1) Subsystem: NVIDIA Corporation Device 1014 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
I followed these instructions to install CUDA 7 and cuDNN:
$sudo apt-get -q2 update $sudo apt-get upgrade $sudo reboot
=======================================================================
Post reboot, update the initramfs by running '$sudo update-initramfs -u'
Now, please edit the /etc/modprobe.d/blacklist.conf file to blacklist nouveau. Open the file in an editor and insert the following lines at the end of the file.
blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off
Save and exit from the file.
Now install the build essential tools and update the initramfs and reboot again as below:
$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential $sudo update-initramfs -u $sudo reboot
========================================================================
Post reboot, run the following commands to install Nvidia.
$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run $sudo chmod 700 ./cuda_7.0.28_linux.run $sudo ./cuda_7.0.28_linux.run $sudo update-initramfs -u $sudo reboot
========================================================================
Now that the system has come up, verify the installation by running the following.
$sudo modprobe nvidia $sudo nvidia-smi -q | head`enter code here`
You should see the output like 'nvidia.png'.
Now run the following commands. $
cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery $make $./deviceQuery
However, 'nvidia-smi' still doesn't show GPU activity while Tensorflow is training models:
ubuntu@ip-10-0-1-48:~$ ipython Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec 6 2015, 18:08:32) Type "copyright", "credits" or "license" for more information. IPython 4.1.2 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: import tensorflow as tf I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally ubuntu@ip-10-0-1-48:~$ nvidia-smi Thu Mar 30 05:45:26 2017 +------------------------------------------------------+ | NVIDIA-SMI 346.46 Driver Version: 346.46 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID K520 Off | 0000:00:03.0 Off | N/A | | N/A 35C P0 38W / 125W | 10MiB / 4095MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
NVIDIA-SMI has failed because it couldn't communicate with the Nvidia driver. Make sure that the latest NVIDIA driver is installed and running. If you are lucky enough, this is a consequence of the driver update and rebooting the machine will make it work again.
The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
I solved "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With