When training either one of two different neural networks, one with Tensorflow and the other with Theano, sometimes after a random amount of time (could be a few hours or minutes, mostly a few hours), the execution freezes and I get this message by running "nvidia-smi":
"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"
I tried to monitor the GPU performance for 13-hours execution, and everything seems stable:
I'm working with:
I'm not sure how to approach this problem, can anyone please suggest ideas of what can cause this and how to diagnose/fix this?
If a TensorFlow operation has both CPU and GPU implementations, by default, the GPU device is prioritized when the operation is assigned. For example, tf. matmul has both CPU and GPU kernels and on a system with devices CPU:0 and GPU:0 , the GPU:0 device is selected to run tf.
Hardware requirements. Note: TensorFlow binaries use AVX instructions which may not run on older CPUs. The following GPU-enabled devices are supported: NVIDIA® GPU card with CUDA® architectures 3.5, 5.0, 6.0, 7.0, 7.5, 8.0 and higher.
Tensorflow is at the base of all these models, as they require the tensor core of the Cuda GPU to perform these Complex computational algorithms within a short period, as compared to CPU.
I posted this question a while ago, but after some investigation back then that took a few weeks, we managed to find the problem (and a solution). I don't remember all the details now, but I'm posting our main conclusion, in case someone will find it useful.
Bottom line is - the hardware we had was not strong enough to support high load GPU-CPU communication. We observed these issues on a rack server with 1 CPU and 4 GPU devices, There was simply an overload on the PCI bus. The problem was solved by adding another CPU to the rack server.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With