I've been working with a GPU version of Tensorflow 0.9.0 on my University's cluster. When I submit the job, it begins running and outputs a message such as:
(stuff that says CUDA found the device...)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:808] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:04:00.0)
However after this, it doesn't begin actually processing anything for a long time. It seems like it just hangs there for a while... For the record, I'm using Imagenet data formatted as in https://github.com/tensorflow/models/blob/master/inception/inception/data, and creating all of my Queues, etc. on a CPU and running all variables/operations on the GPU.
I have tried not explicitly calling for the CPU/GPU splits, and permitting the soft_device_placement to do its thing, but this results in the same hang-ups too.
Edit: Should also mention that even when working with the raw .JPEG files (ie: not using the processing techniques above) this still happens. So, I don't think it's much of an issue with that?
Has anybody else experienced this, and is there anyway around it?
Thank you.
Edit: Code snippet
AlexNet = ConvNet(G,'AlexNet',k=k,H=H,W=W,D=D)
with tf.device('/gpu:0'):
(assemble AlexNet)
train_step,cross_entropy = AlexNet.getTrainStep(LR)
acc = AlexNet.getAccuracyMetric()
AlexNet.finalizeBuild()
print('file io stuff...')
with tf.device('/cpu:0'):
image_holder = tf.placeholder(tf.float32, shape=[None, H,W,D])
label_holder = tf.placeholder(tf.int32)
if mode == 'local':
label_batch = tf.one_hot(label_holder,k)
elif mode =='sherlock':
label_batch = tf.one_hot(label_holder,k,1,0)
image_batch = tf.mul(image_holder,1)
train_dataset = ImagenetData('train')
val_dataset = ImagenetData('validation')
train_images, train_labels = image_processing.inputs(train_dataset)
val_images, val_labels = image_processing.inputs(val_dataset)
#tf.initialize_all_variables()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=AlexNet.session,coord=coord)
print('beginning training')
val_accs = []
losses = []
for itt in range(nitt):
print(itt)
...Training routine
Nvidia drivers take time to wake up for some machines. Run the following command before running the script.
sudo nvidia-persistenced --persistence-mode
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With