Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nvidia-smi process hangs and can't be killed with SIGKILL either

I'm on Ubuntu 14.04, CUDA toolkit 8, driver version 367.48.

When I give nvidia-smi command, it just hangs indefinitely. When I login again and try to kill that nvidia-smi process, with kill -9 <PID> for example, it just isn't killed. If I give another nvidia-smi command, I find both the processes running - of course when logging from another shell, because that gets stuck as before.

Can it be an issue related to the driver? It's not the latest, but still quite new..

like image 984
bio Avatar asked Jan 05 '17 15:01

bio


People also ask

What does Nvidia SMI do?

The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

How do I restart my Nvidia driver Cuda without rebooting?

Replacing the nvidia driver itself can indeed be done without reboot with "sudo rmmod nvidia" & "sudo nvidia-smi". You should anyway make sure that no current cuda processes are running.


2 Answers

I solved this problem by doing at every boot

sudo nvidia-smi -pm 1

The above command enables persistence mode. This issue has been affecting nvidia drivers for over two years but they don't seem interested in fixing it. It seems to be related with a power management issue, after a bit of booting into the OS, if the nvidia-persistenced service has the no-persistence-mode option enabled, the GPU will save power, and the nvidia-smi command will hang waiting for something giving it control again on the device

like image 90
lurscher Avatar answered Oct 20 '22 20:10

lurscher


Given your peculiar situation, I would try to reinstall it, as bio proposed.

Have you tried doing sudo kill -9 <PID>? You probably have but still putting it out there. Or, perhaps doing sudo kill -15 <PID> to terminate it. This seems as if your driver is stuck in a signal 1 hangup given what you told us.

It seems odd that nvidia-smi would hang spontaneously when run, but the issue may underlie in not being installed correctly or not getting run with superuser access.

Have you tried to use:

service nvidia-smi status pgrep nvidia-smi ps -aux | grep nvidia-smi

to get its current state?

Anyway, hope this helps. I would try to uninstall and reinstall or use sudo apt --fix-broken to try and fix broken packages/drivers.

Cheers!

like image 1
Some Raspbian Programmer Avatar answered Oct 20 '22 21:10

Some Raspbian Programmer