Tensorflow startup time?



I've been working with a GPU version of Tensorflow 0.9.0 on my University's cluster. When I submit the job, it begins running and outputs a message such as:

(stuff that says CUDA found the device...)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:808] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:04:00.0)

However after this, it doesn't begin actually processing anything for a long time. It seems like it just hangs there for a while... For the record, I'm using Imagenet data formatted as in https://github.com/tensorflow/models/blob/master/inception/inception/data, and creating all of my Queues, etc. on a CPU and running all variables/operations on the GPU.

I have tried not explicitly calling for the CPU/GPU splits, and permitting the soft_device_placement to do its thing, but this results in the same hang-ups too.

Edit: Should also mention that even when working with the raw .JPEG files (ie: not using the processing techniques above) this still happens. So, I don't think it's much of an issue with that?

Has anybody else experienced this, and is there anyway around it?

Thank you.

Edit: Code snippet

AlexNet = ConvNet(G,'AlexNet',k=k,H=H,W=W,D=D)

with tf.device('/gpu:0'):
    (assemble AlexNet)

    train_step,cross_entropy = AlexNet.getTrainStep(LR)
    acc = AlexNet.getAccuracyMetric()

print('file io stuff...')
with tf.device('/cpu:0'):
    image_holder = tf.placeholder(tf.float32, shape=[None, H,W,D])
    label_holder = tf.placeholder(tf.int32)

    if mode == 'local':
        label_batch = tf.one_hot(label_holder,k)
    elif mode =='sherlock':
        label_batch = tf.one_hot(label_holder,k,1,0)

    image_batch = tf.mul(image_holder,1)

    train_dataset = ImagenetData('train')
    val_dataset = ImagenetData('validation')
    train_images, train_labels = image_processing.inputs(train_dataset)
    val_images, val_labels = image_processing.inputs(val_dataset)

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=AlexNet.session,coord=coord)

print('beginning training')

val_accs = []
losses = [] 

for itt in range(nitt):
    ...Training routine
KTF Avatar asked Dec 31 '16 17:12


1 Answers

Nvidia drivers take time to wake up for some machines. Run the following command before running the script.

sudo nvidia-persistenced --persistence-mode
Trideep Rath Avatar answered Nov 04 '22 20:11
Trideep Rath

Trideep Rath