When writing checkpoint files using a tf.train.MonitoredTrainingSession
it somehow writes multiple metagraphs. What am I doing wrong?
I stripped it down to the following code:
import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/",
save_steps = 10,
saver = saver))]
with tf.train.MonitoredTrainingSession(master = '',
is_chief = True,
checkpoint_dir = None,
hooks = hooks,
save_checkpoint_secs = None,
save_summaries_steps = None,
save_summaries_secs = None) as mon_sess:
for i in range(30):
if mon_sess.should_stop():
break
try:
gs, _ = mon_sess.run([global_step, train])
print(gs)
except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
break
finally:
pass
Running this will give duplicate metagraphs, as evidenced by the tensorboard warning:
$ tensorboard --logdir ../train/test1/ --port=6006
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)
This is in tensorflow 1.2.0 (I cannot upgrade).
Running the same thing without a monitored session gives the right checkpoint output:
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(30):
gs, _ = sess.run([global_step, train])
print(gs)
if i%10==0:
saver.save(sess, output_path+'/test2/my-model', global_step=gs)
print("Saved ckpt")
Results in no tensorboard errors:
$ tensorboard --logdir ../traitest2/ --port=6006
Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)
I'd like to fix this as I suspect I'm missing something fundamental, and this error may have some connection to other issues I have in distributed mode. I have to restart tensorboard anytime I want to update the data. Moreover, TensorBoard seems to get really slow over time when it puts out many of these warnings.
There is a related question: tensorflow Found more than one graph event per run In this case the errors were due to multiple runs (with different parameters) written to the same output directory. The case here is about a single run to a clean output directory.
Running the MonitoredTrainingSession version in distributed mode gives the same errors.
Update Oct-12
@Nikhil Kothari suggested to use tf.train.MonitoredSession
instead of the larger tf.train.MonitoredTrainSession
wrapper, as follows:
import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks[(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test3/ckpt/",
save_steps=10,
saver=saver))]
chiefsession = tf.train.ChiefSessionCreator(scaffold=None,
master='',
config=None,
checkpoint_dir=None,
checkpoint_filename_with_path=None)
with tf.train.MonitoredSession(session_creator=chiefsession,
hooks=hooks,
stop_grace_period_secs=120) as mon_sess:
for i in range(30):
if mon_sess.should_stop():
break
try:
gs, _ = mon_sess.run([global_step, train])
print(gs)
except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
break
finally:
pass
Unfortunately this still gives the same tensorboard errors:
$ tensorboard --logdir ../train/test3/ --port=6006
WARNING:tensorflow:Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)
btw, each codeblock is stand-alone, copy=paste it in a Jupyter notebook and you will replicate the problem.
I wonder if this is because every node in your cluster is running the same code, declaring itself as a chief, and saving out graphs and checkpoints.
I don't if the is_chief = True is just illustrative in the post here on Stack Overflow or that is exactly what you are using... so guessing a bit here.
I personally used MonitoredSession instead of MonitoredTrainingSession and created a list of hooks based on whether the code is running on the master/chief or not. Example: https://github.com/TensorLab/tensorfx/blob/master/src/training/_trainer.py#L94
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With