Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MonitoredTrainingSession writes more than one metagraph event per run

When writing checkpoint files using a tf.train.MonitoredTrainingSession it somehow writes multiple metagraphs. What am I doing wrong?

I stripped it down to the following code:

import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/",
                                          save_steps = 10,
                                          saver = saver))]

with tf.train.MonitoredTrainingSession(master = '',
                                       is_chief = True,
                                       checkpoint_dir = None,
                                       hooks = hooks,
                                       save_checkpoint_secs = None,
                                       save_summaries_steps = None,
                                       save_summaries_secs = None) as mon_sess:
    for i in range(30):
        if mon_sess.should_stop():
            break
        try:
            gs, _ = mon_sess.run([global_step, train])
            print(gs)
        except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
            break
        finally:
            pass

Running this will give duplicate metagraphs, as evidenced by the tensorboard warning:

$ tensorboard --logdir ../train/test1/ --port=6006

WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)

This is in tensorflow 1.2.0 (I cannot upgrade).

Running the same thing without a monitored session gives the right checkpoint output:

global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init_op)
    for i in range(30):
        gs, _ = sess.run([global_step, train])
        print(gs)
        if i%10==0:
            saver.save(sess, output_path+'/test2/my-model', global_step=gs)
            print("Saved ckpt")

Results in no tensorboard errors:

$ tensorboard --logdir ../traitest2/ --port=6006

Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)

I'd like to fix this as I suspect I'm missing something fundamental, and this error may have some connection to other issues I have in distributed mode. I have to restart tensorboard anytime I want to update the data. Moreover, TensorBoard seems to get really slow over time when it puts out many of these warnings.

There is a related question: tensorflow Found more than one graph event per run In this case the errors were due to multiple runs (with different parameters) written to the same output directory. The case here is about a single run to a clean output directory.

Running the MonitoredTrainingSession version in distributed mode gives the same errors.

Update Oct-12

@Nikhil Kothari suggested to use tf.train.MonitoredSession instead of the larger tf.train.MonitoredTrainSession wrapper, as follows:

import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks[(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test3/ckpt/",
                                            save_steps=10,
                                            saver=saver))]

chiefsession = tf.train.ChiefSessionCreator(scaffold=None,
                                            master='',
                                            config=None,
                                            checkpoint_dir=None,
                                            checkpoint_filename_with_path=None)
with tf.train.MonitoredSession(session_creator=chiefsession,
                hooks=hooks,
                stop_grace_period_secs=120) as mon_sess:
    for i in range(30):
        if mon_sess.should_stop():
            break
        try:
            gs, _ = mon_sess.run([global_step, train])
            print(gs)
        except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
            break
        finally:
            pass

Unfortunately this still gives the same tensorboard errors:

$ tensorboard --logdir ../train/test3/ --port=6006

WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)

btw, each codeblock is stand-alone, copy=paste it in a Jupyter notebook and you will replicate the problem.

like image 755
Bastiaan Avatar asked Oct 08 '17 22:10

Bastiaan


1 Answers

I wonder if this is because every node in your cluster is running the same code, declaring itself as a chief, and saving out graphs and checkpoints.

I don't if the is_chief = True is just illustrative in the post here on Stack Overflow or that is exactly what you are using... so guessing a bit here.

I personally used MonitoredSession instead of MonitoredTrainingSession and created a list of hooks based on whether the code is running on the master/chief or not. Example: https://github.com/TensorLab/tensorfx/blob/master/src/training/_trainer.py#L94

like image 120
Nikhil Kothari Avatar answered Nov 14 '22 22:11

Nikhil Kothari