Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stopping and starting a deep learning google cloud VM instance causes tensorflow to stop recognizing GPU

I am using the pre-built deep learning VM instances offered by google cloud, with an Nvidia tesla K80 GPU attached. I choose to have Tensorflow 2.5 and CUDA 11.0 automatically installed. When I start the instance, everything works great - I can run:

Import tensorflow as tf
tf.config.list_physical_devices()

And my function returns the CPU, accelerated CPU, and the GPU. Similarly, if I run tf.test.is_gpu_available(), the function returns True.

However, if I log out, stop the instance, and then restart the instance, running the same exact code only sees the CPU and tf.test.is_gpu_available() results in False. I get an error that looks like the driver initialization is failing:

 E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

Running nvidia-smi shows that the computer still sees the GPU, but my tensorflow can’t see it.

Does anyone know what could be causing this? I don’t want to have to reinstall everything when I’m restarting the instance.

like image 407
Olivia Alcabes Avatar asked Jun 24 '21 16:06

Olivia Alcabes


People also ask

How do I change GPU in GCP?

In the Machine type section, click Customize to see advanced machine type options and available GPUs. Click GPUs to see the list of available GPUs. Select the number of GPUs and the GPU model that you want to add to your instance.


2 Answers

Some people (sadly not me) are able to resolve this by setting the following at the beginning of their script/main:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

I had to reinstall CUDA drivers and from then on it worked even after restarting the instance. You can configure your system settings on NVIDIAs website and it will provide you the commands you need to follow to install cuda. It also asks you if you want to uninstall the previous cuda version (yes!).This is luckily also very fast.

like image 70
a-doering Avatar answered Oct 21 '22 11:10

a-doering


I fixed the same issue with the commands below, taken from https://issuetracker.google.com/issues/191612865?pli=1

gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh

chmod +x /tmp/restart_patch.sh

sudo /tmp/restart_patch.sh

sudo service jupyter restart
like image 1
howard Avatar answered Oct 21 '22 12:10

howard