Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

system auto reboot when tensorflow model is too large

Tags:

tensorflow

I'm using a nvidia GTX1080 gpu(8GB) to run Inception model on tensorflow, when I set batch_size = 16 and image_size = 400, then after I start the program, my ubuntu14.04 will auto reboot.

like image 786
Rfank2019 Avatar asked Aug 24 '16 12:08

Rfank2019


2 Answers

I tracked the issue down to a faulty power supply. It had enough capacity according to spec, and limiting GPU power consumption by running "nvidia-smi -pl 150" didn't help at all. Probably it couldn't handle bursts in power consumption.
Anyway, after I changed the power supply from "Corsair CX750 Builder Series ATX 80 PLUS" to "Cooler Master V1000", the issue is gone. See details of my investigation in the TensorFlow GitHub issue.

like image 85
Pavel Surmenok Avatar answered Oct 23 '22 13:10

Pavel Surmenok


Make sure it is not a power supply unit problem. I was observing strange occasional reboots on my development machine. As I was increasing the size of input (batch size, larger NN) the rate of reboots was increasing as well. Turned out to be a PSU problem. A quick check is to limit GPU power consumption and see if this behavior will go away. For instance, you can limit power to about 150 watts with this command (you'll need a sudo rights):

sudo nvidia-smi -pl 150
like image 21
Sergey Avatar answered Oct 23 '22 11:10

Sergey