Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

Tags:

gpu

I'm running an AWS EC2 g2.2xlarge instance with Ubuntu 14.04 LTS. I'd like to observe the GPU utilization while training my TensorFlow models. I get an error trying to run 'nvidia-smi'.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig nvidia-cuda-mps-control  nvidia-persistenced nvidia-cuda-mps-server   nvidia-smi ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.   ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia  ii  nvidia-346                                            352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-346 ii  nvidia-346-dev                                        346.46-0ubuntu1                                     amd64        NVIDIA binary Xorg driver development files ii  nvidia-346-uvm                                        346.96-0ubuntu0.0.1                                 amd64        Transitional package for nvidia-346 ii  nvidia-352                                            375.26-0ubuntu1                                     amd64        Transitional package for nvidia-375 ii  nvidia-375                                            375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary driver - version 375.39 ii  nvidia-375-dev                                        375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary Xorg driver development files ii  nvidia-modprobe                                       375.26-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files ii  nvidia-opencl-icd-346                                 352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-opencl-icd-352 ii  nvidia-opencl-icd-352                                 375.26-0ubuntu1                                     amd64        Transitional package for nvidia-opencl-icd-375 ii  nvidia-opencl-icd-375                                 375.39-0ubuntu0.14.04.1                             amd64        NVIDIA OpenCL ICD ii  nvidia-prime                                          0.6.2.1                                             amd64        Tools to enable NVIDIA's Prime ii  nvidia-settings                                       375.26-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1) ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$   $ inxi -G Graphics:  Card-1: Cirrus Logic GD 5446             Card-2: NVIDIA GK104GL [GRID K520]             X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X  $  lspci -k | grep -A 2 -E "(VGA|3D)" 00:02.0 VGA compatible controller: Cirrus Logic GD 5446     Subsystem: XenSource, Inc. Device 0001     Kernel driver in use: cirrus 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)     Subsystem: NVIDIA Corporation Device 1014 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01) 

I followed these instructions to install CUDA 7 and cuDNN:

$sudo apt-get -q2 update $sudo apt-get upgrade $sudo reboot 

=======================================================================

Post reboot, update the initramfs by running '$sudo update-initramfs -u'

Now, please edit the /etc/modprobe.d/blacklist.conf file to blacklist nouveau. Open the file in an editor and insert the following lines at the end of the file.

blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off

Save and exit from the file.

Now install the build essential tools and update the initramfs and reboot again as below:

$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential $sudo update-initramfs -u $sudo reboot 

========================================================================

Post reboot, run the following commands to install Nvidia.

$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run $sudo chmod 700 ./cuda_7.0.28_linux.run $sudo ./cuda_7.0.28_linux.run $sudo update-initramfs -u $sudo reboot 

========================================================================

Now that the system has come up, verify the installation by running the following.

$sudo modprobe nvidia $sudo nvidia-smi -q | head`enter code here` 

You should see the output like 'nvidia.png'.

Now run the following commands. $

cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery $make $./deviceQuery 

However, 'nvidia-smi' still doesn't show GPU activity while Tensorflow is training models:

ubuntu@ip-10-0-1-48:~$ ipython Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32)  Type "copyright", "credits" or "license" for more information.  IPython 4.1.2 -- An enhanced Interactive Python. ?         -> Introduction and overview of IPython's features. %quickref -> Quick reference. help      -> Python's own help system. object?   -> Details about 'object', use 'object??' for extra details.  In [1]: import tensorflow as tf  I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally    ubuntu@ip-10-0-1-48:~$ nvidia-smi Thu Mar 30 05:45:26 2017        +------------------------------------------------------+                        | NVIDIA-SMI 346.46     Driver Version: 346.46         |                        |-------------------------------+----------------------+----------------------+ | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | |===============================+======================+======================| |   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A | | N/A   35C    P0    38W / 125W |     10MiB /  4095MiB |      0%      Default | +-------------------------------+----------------------+----------------------+  +-----------------------------------------------------------------------------+ | Processes:                                                       GPU Memory | |  GPU       PID  Type  Process name                               Usage      | |=============================================================================| |  No running processes found                                                 | +-----------------------------------------------------------------------------+ 
like image 630
dbl001 Avatar asked Mar 23 '17 18:03

dbl001


People also ask

How do you fix NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver?

NVIDIA-SMI has failed because it couldn't communicate with the Nvidia driver. Make sure that the latest NVIDIA driver is installed and running. If you are lucky enough, this is a consequence of the driver update and rebooting the machine will make it work again.

What is NVIDIA-SMI?

The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.


1 Answers

I solved "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.

like image 69
nuicca Avatar answered Sep 22 '22 18:09

nuicca