MonitoredTrainingSession writes more than one metagraph event per run

Question

When writing checkpoint files using a tf.train.MonitoredTrainingSession it somehow writes multiple metagraphs. What am I doing wrong?

I stripped it down to the following code:

import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/",
                                          save_steps = 10,
                                          saver = saver))]

with tf.train.MonitoredTrainingSession(master = '',
                                       is_chief = True,
                                       checkpoint_dir = None,
                                       hooks = hooks,
                                       save_checkpoint_secs = None,
                                       save_summaries_steps = None,
                                       save_summaries_secs = None) as mon_sess:
    for i in range(30):
        if mon_sess.should_stop():
            break
        try:
            gs, _ = mon_sess.run([global_step, train])
            print(gs)
        except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
            break
        finally:
            pass

Running this will give duplicate metagraphs, as evidenced by the tensorboard warning:

$ tensorboard --logdir ../train/test1/ --port=6006

WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)

This is in tensorflow 1.2.0 (I cannot upgrade).

Running the same thing without a monitored session gives the right checkpoint output:

global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init_op)
    for i in range(30):
        gs, _ = sess.run([global_step, train])
        print(gs)
        if i%10==0:
            saver.save(sess, output_path+'/test2/my-model', global_step=gs)
            print("Saved ckpt")

Results in no tensorboard errors:

$ tensorboard --logdir ../traitest2/ --port=6006

Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)

I'd like to fix this as I suspect I'm missing something fundamental, and this error may have some connection to other issues I have in distributed mode. I have to restart tensorboard anytime I want to update the data. Moreover, TensorBoard seems to get really slow over time when it puts out many of these warnings.

There is a related question: tensorflow Found more than one graph event per run In this case the errors were due to multiple runs (with different parameters) written to the same output directory. The case here is about a single run to a clean output directory.

Running the MonitoredTrainingSession version in distributed mode gives the same errors.

Update Oct-12

@Nikhil Kothari suggested to use tf.train.MonitoredSession instead of the larger tf.train.MonitoredTrainSession wrapper, as follows:

import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks[(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test3/ckpt/",
                                            save_steps=10,
                                            saver=saver))]

chiefsession = tf.train.ChiefSessionCreator(scaffold=None,
                                            master='',
                                            config=None,
                                            checkpoint_dir=None,
                                            checkpoint_filename_with_path=None)
with tf.train.MonitoredSession(session_creator=chiefsession,
                hooks=hooks,
                stop_grace_period_secs=120) as mon_sess:
    for i in range(30):
        if mon_sess.should_stop():
            break
        try:
            gs, _ = mon_sess.run([global_step, train])
            print(gs)
        except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
            break
        finally:
            pass

Unfortunately this still gives the same tensorboard errors:

$ tensorboard --logdir ../train/test3/ --port=6006

WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)

btw, each codeblock is stand-alone, copy=paste it in a Jupyter notebook and you will replicate the problem.

Nikhil Kothari · Accepted Answer

I wonder if this is because every node in your cluster is running the same code, declaring itself as a chief, and saving out graphs and checkpoints.

I don't if the is_chief = True is just illustrative in the post here on Stack Overflow or that is exactly what you are using... so guessing a bit here.

I personally used MonitoredSession instead of MonitoredTrainingSession and created a list of hooks based on whether the code is running on the master/chief or not. Example: https://github.com/TensorLab/tensorfx/blob/master/src/training/_trainer.py#L94

MonitoredTrainingSession writes more than one metagraph event per run

Tags:

python

tensorflow

google-cloud-ml

Bastiaan

1 Answers

Nikhil Kothari

Recent Activity

Donate For Us

MonitoredTrainingSession writes more than one metagraph event per run

Tags:

python

tensorflow

google-cloud-ml

Bastiaan

1 Answers

Nikhil Kothari

Related questions

Recent Activity

Donate For Us