NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

Tags:

gpu

I'm running an AWS EC2 g2.2xlarge instance with Ubuntu 14.04 LTS. I'd like to observe the GPU utilization while training my TensorFlow models. I get an error trying to run 'nvidia-smi'.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig nvidia-cuda-mps-control  nvidia-persistenced nvidia-cuda-mps-server   nvidia-smi ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.   ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia  ii  nvidia-346                                            352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-346 ii  nvidia-346-dev                                        346.46-0ubuntu1                                     amd64        NVIDIA binary Xorg driver development files ii  nvidia-346-uvm                                        346.96-0ubuntu0.0.1                                 amd64        Transitional package for nvidia-346 ii  nvidia-352                                            375.26-0ubuntu1                                     amd64        Transitional package for nvidia-375 ii  nvidia-375                                            375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary driver - version 375.39 ii  nvidia-375-dev                                        375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary Xorg driver development files ii  nvidia-modprobe                                       375.26-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files ii  nvidia-opencl-icd-346                                 352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-opencl-icd-352 ii  nvidia-opencl-icd-352                                 375.26-0ubuntu1                                     amd64        Transitional package for nvidia-opencl-icd-375 ii  nvidia-opencl-icd-375                                 375.39-0ubuntu0.14.04.1                             amd64        NVIDIA OpenCL ICD ii  nvidia-prime                                          0.6.2.1                                             amd64        Tools to enable NVIDIA's Prime ii  nvidia-settings                                       375.26-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1) ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$   $ inxi -G Graphics:  Card-1: Cirrus Logic GD 5446             Card-2: NVIDIA GK104GL [GRID K520]             X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X  $  lspci -k | grep -A 2 -E "(VGA|3D)" 00:02.0 VGA compatible controller: Cirrus Logic GD 5446     Subsystem: XenSource, Inc. Device 0001     Kernel driver in use: cirrus 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)     Subsystem: NVIDIA Corporation Device 1014 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

I followed these instructions to install CUDA 7 and cuDNN:

$sudo apt-get -q2 update $sudo apt-get upgrade $sudo reboot

=======================================================================

Post reboot, update the initramfs by running '$sudo update-initramfs -u'

Now, please edit the /etc/modprobe.d/blacklist.conf file to blacklist nouveau. Open the file in an editor and insert the following lines at the end of the file.

blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off

Save and exit from the file.

Now install the build essential tools and update the initramfs and reboot again as below:

$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential $sudo update-initramfs -u $sudo reboot

========================================================================

Post reboot, run the following commands to install Nvidia.

$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run $sudo chmod 700 ./cuda_7.0.28_linux.run $sudo ./cuda_7.0.28_linux.run $sudo update-initramfs -u $sudo reboot

========================================================================

Now that the system has come up, verify the installation by running the following.

$sudo modprobe nvidia $sudo nvidia-smi -q | head`enter code here`

You should see the output like 'nvidia.png'.

Now run the following commands. $

cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery $make $./deviceQuery

However, 'nvidia-smi' still doesn't show GPU activity while Tensorflow is training models:

ubuntu@ip-10-0-1-48:~$ ipython Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32)  Type "copyright", "credits" or "license" for more information.  IPython 4.1.2 -- An enhanced Interactive Python. ?         -> Introduction and overview of IPython's features. %quickref -> Quick reference. help      -> Python's own help system. object?   -> Details about 'object', use 'object??' for extra details.  In [1]: import tensorflow as tf  I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally    ubuntu@ip-10-0-1-48:~$ nvidia-smi Thu Mar 30 05:45:26 2017        +------------------------------------------------------+                        | NVIDIA-SMI 346.46     Driver Version: 346.46         |                        |-------------------------------+----------------------+----------------------+ | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | |===============================+======================+======================| |   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A | | N/A   35C    P0    38W / 125W |     10MiB /  4095MiB |      0%      Default | +-------------------------------+----------------------+----------------------+  +-----------------------------------------------------------------------------+ | Processes:                                                       GPU Memory | |  GPU       PID  Type  Process name                               Usage      | |=============================================================================| |  No running processes found                                                 | +-----------------------------------------------------------------------------+

630

asked Mar 23 '17 18:03

dbl001

1 Answers

I solved "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.

answered Sep 22 '22 18:09

nuicca

Related questions
                            
                                Get total amount of free GPU memory and available using pytorch
                            
                                Tensorflow GPU Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
                            
                                How to obtain OpenCL SDK?
                            
                                What is CUDA like? What is it for? What are the benefits? And how to start?
                            
                                What is actually a Queue family in Vulkan?
                            
                                printf inside CUDA __global__ function
                            
                                Java GPU programming [closed]
                            
                                Can OpenMP be used for GPUs?
                            
                                Which OpenGL functions are not GPU-accelerated?
                            
                                GPU programming on Android devices
                            
                                How to kill process on GPUs with PID in nvidia-smi using keyword?
                            
                                CUDA apps time out & fail after several seconds - how to work around this?
                            
                                When is CUDA's __shared__ memory useful?
                            
                                Fastest SVM implementation usable in Python [closed]
                            
                                How does CUDA assign device IDs to GPUs?
                            
                                WKWebView crashes in acceleratedAnimationDidStart
                            
                                CPU SIMD vs GPU SIMD?
                            
                                Setting up Visual Studio Intellisense for CUDA kernel calls
                            
                                How to use GPU for mathematics [closed]
                            
                                Run C# code on GPU

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With