How to use evaluation_loop with train_loop in tf-slim

Tags:

I'm trying to implement a few different models and train them on CIFAR-10, and I want to use TF-slim to do this. It looks like TF-slim has two main loops that are useful during training: train_loop and evaluation_loop.

My question is: what is the canonical way to use these loops? As a followup: is it possible to use early stopping with train_loop?

Currently I have a model and my training file train.py looks like this

Click to copy

import ...
train_log_dir = ...

with tf.device("/cpu:0"):
  images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching( 
                                                                subset='train', ... )
logits, end_points = set_up_model( images ) // Possibly using many GPUs
total_loss = set_up_loss( logits, labels, dataset )
optimizer, global_step = set_up_optimizer( dataset )
train_tensor = slim.learning.create_train_op( 
                                      total_loss, 
                                      optimizer,
                                      global_step=global_step,
                                      clip_gradient_norm=FLAGS.clip_gradient_norm,
                                      summarize_gradients=True)
slim.learning.train(train_tensor, 
                      logdir=train_log_dir,
                      local_init_op=tf.initialize_local_variables(),
                      save_summaries_secs=FLAGS.save_summaries_secs,
                      save_interval_secs=FLAGS.save_interval_secs)

Which is awesome so far - my models all train and converge nicely. I can see this from the events in train_log_dir where all the metrics are going in the right direction. And going in the right direction makes me happy.

But I'd like to check that the metrics are improving on the validation set, too. I don't know of any way to do with TF-slim in a way that plays nicely with the training loop, so I created a second file called eval.py which contains my evaluation loop.

Click to copy

import ...
train_log_dir = ...

with tf.device("/cpu:0"):
  images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching( 
                                                                subset='validation', ... )
logits, end_points = set_up_model( images )
summary_ops, names_to_values, names_to_updates = create_metrics_and_summary_ops( 
                                                                logits,
                                                                labels,
                                                                dataset.num_classes() )

slim.get_or_create_global_step()
slim.evaluation.evaluation_loop(
      '',
      checkpoint_dir=train_log_dir,
      logdir=train_log_dir,
      num_evals=FLAGS.num_eval_batches,
      eval_op=names_to_updates.values(),
      summary_op=tf.merge_summary(summary_ops),
      eval_interval_secs=FLAGS.eval_interval_secs,
      session_config=config)

Questions:

1) I currently have this model for the evaluation_loop hogging up an entire GPU, but it's rarely being used. I assume there's a better way to allocate resources. It would be pretty nice if I could use the same evaluation_loop to monitor the progress of multiple different models (checkpoints in multiple directories). Is something like this possible?

2) There's no feedback between the evaluation and training. I'm training a ton of models and would love to use early stopping to halt the models which aren't learning or are not converging. Is there a way to do this? Ideally using information from the validation set, but if it has to be just based on the training data that's okay, too.

3) Is my workflow all wrong and I should be structuring it differently? It's not clear from the documentation how to use evaluation in conjunction with training.

Update ~~It seems that as of TF r0.11 I'm also getting a segfault when calling slim.evaluation.evaluation_loop. It only happens sometimes (for me when I dispatch my jobs to a cluster). It happens in sv.managed_session--specifically prepare_or_wait_for_session.~~ This was just due to evaluation loop (a second instance of tensorflow) trying to use the GPU, which was already requisitioned by the first instance.

957

asked Sep 27 '16 19:09

Alex Sax

2 Answers

evaluation_loop is meant to be used (as you are currently using it) with a single directory. If you want to be more efficient, you could use slim.evaluation.evaluate_once and add the appropriate logic for swapping directories as you find appropriate.
You can do this by overriding the slim.learning.train(..., train_step_fn) argument. This argument replaces the 'train_step' function with a custom function. Here, you can supply custom training function which returns the 'total_loss' and 'should_stop' values as you see fit.
Your workflow looks great, this is probably the most common workflow for learning/eval using TF-Slim.

126

answered Sep 27 '22 18:09

Nathan Silberman

Thanks to @kmalakoff, the TensorFlow issue gave a brilliant way to the problem that how to validate or test model in tf.slim training. The main idea is overriding train_step_fn function:

Click to copy

import …
from tensorflow.contrib.slim.python.slim.learning import train_step

...

accuracy_validation = ...
accuracy_test = ...

def train_step_fn(session, *args, **kwargs):
    total_loss, should_stop = train_step(session, *args, **kwargs)

    if train_step_fn.step % FLAGS.validation_every_n_step == 0:
        accuracy = session.run(train_step_fn.accuracy_validation)
        print('your validation info')

    if train_step_fn.step % FLAGS.test_every_n_step == 0:
        accuracy = session.run(train_step_fn.accuracy_test)
        print('your test info')

    train_step_fn.step += 1
    return [total_loss, should_stop] 

train_step_fn.step = 0
train_step_fn.accuracy_validation = accuracy_validation
train_step_fn.accuracy_test = accuracy_test

# run training.
slim.learning.train(
    train_op,
    FLAGS.logs_dir,
    train_step_fn=train_step_fn,
    graph=graph,
    number_of_steps=FLAGS.max_steps)

answered Sep 27 '22 16:09

era_misa

Related questions
                            
                                How to register a custom gradient for a operation composed of tf operations
                            
                                how do I use a very large (>2M) word embedding in tensorflow?
                            
                                Mean Euclidean distance in Tensorflow
                            
                                How to run Tensorflow Estimator on multiple GPUs with data parallelism
                            
                                How to make tf.data.Dataset return all of the elements in one call?
                            
                                Why is the Mean Average Percentage Error(mape) extremely high?
                            
                                Tensorflow 2 throwing ValueError: as_list() is not defined on an unknown TensorShape
                            
                                Use GPU with opencv-python
                            
                                Tensorflow return similar images
                            
                                Save tensorflow model to file
                            
                                hdf5 not supported (please install/reinstall h5py) Scipy not supported! when importing TFLearn?
                            
                                Conflict Protobuf version when using Opencv and Tensorflow c++
                            
                                Image augmentation makes performance worse [closed]
                            
                                tar: Unrecognized archive format error when trying to unpack flower_photos.tgz, TF tutorials on OSX
                            
                                Tensorflow Object Detection API
                            
                                What the impact of different dimension of image resizer when using default config of object detection api
                            
                                How to use TensorFlow metrics in Keras
                            
                                TensorFlow Eager Mode: How to restore a model from a checkpoint?
                            
                                Tensorflow==2.0.0a0 - AttributeError: module 'tensorflow' has no attribute 'global_variables_initializer'
                            
                                Predicting the next word using the LSTM ptb model tensorflow example

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use evaluation_loop with train_loop in tf-slim

Tags:

tensorflow

tf-slim

Alex Sax

People also ask

2 Answers

Nathan Silberman

era_misa

Recent Activity

Donate For Us