Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GPU is lost during execution of either Tensorflow or Theano code

When training either one of two different neural networks, one with Tensorflow and the other with Theano, sometimes after a random amount of time (could be a few hours or minutes, mostly a few hours), the execution freezes and I get this message by running "nvidia-smi":

"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"

I tried to monitor the GPU performance for 13-hours execution, and everything seems stable: enter image description here

I'm working with:

  • Ubuntu 14.04.5 LTS
  • GPUs are Nvidia Titan Xp (this behavior repeats on another GPU on the same machine)
  • CUDA 8.0
  • CuDNN 5.1
  • Tensorflow 1.3
  • Theano 0.8.2

I'm not sure how to approach this problem, can anyone please suggest ideas of what can cause this and how to diagnose/fix this?

like image 350
Mega Avatar asked Aug 26 '17 04:08

Mega


People also ask

Does TensorFlow automatically uses GPU?

If a TensorFlow operation has both CPU and GPU implementations, by default, the GPU device is prioritized when the operation is assigned. For example, tf. matmul has both CPU and GPU kernels and on a system with devices CPU:0 and GPU:0 , the GPU:0 device is selected to run tf.

Which GPU work with TensorFlow?

Hardware requirements. Note: TensorFlow binaries use AVX instructions which may not run on older CPUs. The following GPU-enabled devices are supported: NVIDIA® GPU card with CUDA® architectures 3.5, 5.0, 6.0, 7.0, 7.5, 8.0 and higher.

Does TensorFlow use CUDA cores?

Tensorflow is at the base of all these models, as they require the tensor core of the Cuda GPU to perform these Complex computational algorithms within a short period, as compared to CPU.


1 Answers

I posted this question a while ago, but after some investigation back then that took a few weeks, we managed to find the problem (and a solution). I don't remember all the details now, but I'm posting our main conclusion, in case someone will find it useful.

Bottom line is - the hardware we had was not strong enough to support high load GPU-CPU communication. We observed these issues on a rack server with 1 CPU and 4 GPU devices, There was simply an overload on the PCI bus. The problem was solved by adding another CPU to the rack server.

like image 127
Mega Avatar answered Sep 19 '22 14:09

Mega