GPU is lost during execution of either Tensorflow or Theano code

Tags:

When training either one of two different neural networks, one with Tensorflow and the other with Theano, sometimes after a random amount of time (could be a few hours or minutes, mostly a few hours), the execution freezes and I get this message by running "nvidia-smi":

"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"

I tried to monitor the GPU performance for 13-hours execution, and everything seems stable: enter image description here

I'm working with:

Ubuntu 14.04.5 LTS
GPUs are Nvidia Titan Xp (this behavior repeats on another GPU on the same machine)
CUDA 8.0
CuDNN 5.1
Tensorflow 1.3
Theano 0.8.2

I'm not sure how to approach this problem, can anyone please suggest ideas of what can cause this and how to diagnose/fix this?

350

asked Aug 26 '17 04:08

Mega

1 Answers

I posted this question a while ago, but after some investigation back then that took a few weeks, we managed to find the problem (and a solution). I don't remember all the details now, but I'm posting our main conclusion, in case someone will find it useful.

Bottom line is - the hardware we had was not strong enough to support high load GPU-CPU communication. We observed these issues on a rack server with 1 CPU and 4 GPU devices, There was simply an overload on the PCI bus. The problem was solved by adding another CPU to the rack server.

127

answered Sep 19 '22 14:09

Mega

Related questions
                            
                                nvidia-smi does not display memory usage [closed]
                            
                                Solving dense linear systems AX = B with CUDA
                            
                                How to get memory bandwidth from memory clock/memory speed
                            
                                Could not satisfy explicit device specification '/device:GPU:0' because no devices matching
                            
                                OpenGL GPU Memory cleanup, required?
                            
                                How to convert GpuMat to CvMat in OpenCV?
                            
                                How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?
                            
                                Anaconda Integration with Cuda 9.0 shows Incompatible Package Error
                            
                                Is it possible to build an `nvidia/cuda`-based image on a server without a GPU?
                            
                                OpenGL (ES 2.0) VBO Performances in a Shared Memory Architecture
                            
                                MPI + GPU : how to mix the two techniques
                            
                                Why does CUDA float program get faster in full speed FP64 mode?
                            
                                Calculate average of pixels in the front buffer of the gpu without copying the front buffer back to system memory
                            
                                Speeding up rendering in SceneKit
                            
                                Specify either CPU or GPU for multiple models tensorflow java's job
                            
                                How to enable Keras with Theano to utilize multiple GPUs
                            
                                Are there MapReduce implementations on GPUs (CUDA)?
                            
                                How do I test OpenCL on GPU when logged in remotely on Mac?
                            
                                Using GPU to speed up BigInteger calculations
                            
                                Run OpenGL on AWS GPU instances with Ubuntu

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

GPU is lost during execution of either Tensorflow or Theano code

Tags:

gpu

nvidia

cudnn

tensorflow-gpu

theano-cuda

Mega

People also ask

1 Answers

Mega

Recent Activity

Donate For Us