How to control when to compute evaluation vs training using the Estimator API of tensorflow?

Tags:

tensorflow

As stated in this question:

The tensorflow documentation does not provide any example of how to perform a periodic evaluation of the model on an evaluation set

The accepted answer suggested the use of Experiment (which is deprecated according to this README).

All I found on online points towards using the train_and_evaluate method. However, I still do not see how to switch between the two processes (train and evaluate). I have tried the following:

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    params=hparams,
    model_dir=model_dir,
    config = tf.estimator.RunConfig(
        save_checkpoints_steps = 2000,
        save_summary_steps = 100,
        keep_checkpoint_max=5
    )
)

train_input_fn = lambda: input_fn(
    train_file, #a .tfrecords file
    train=True,
    batch_size=70,
    num_epochs=100
)

eval_input_fn = lambda: input_fn(
    val_file, # another .tfrecords file
    train=False,
    batch_size=70,
    num_epochs=1
)
train_spec = tf.estimator.TrainSpec(
    train_input_fn,
    max_steps=125
)    

eval_spec = tf.estimator.EvalSpec(
    eval_input_fn,
    steps=30,
    name='validation',
    start_delay_secs=150,
    throttle_secs=200
)

tf.logging.info("start experiment...")
tf.estimator.train_and_evaluate(
    estimator,
    train_spec,
    eval_spec
)

Here is what I think my code should be doing:

Train the model for 100 epochs using a batch size of 70; save checkpoints every 2000 batches; save summaries every 100 batches; keep at most 5 checkpoints; after 150 batches on the training set, compute the validation error using 30 batches of validation data

However, I get the following logs:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt.
INFO:tensorflow:loss = 39.55082, step = 1
INFO:tensorflow:global_step/sec: 178.622
INFO:tensorflow:loss = 1.0455043, step = 101 (0.560 sec)
INFO:tensorflow:Saving checkpoints for 150 into /output/model.ckpt.
INFO:tensorflow:Loss for final step: 0.8327793.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /projects/MNIST-GCP/output/model.ckpt-150
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [3/30]
INFO:tensorflow:Evaluation [6/30]
INFO:tensorflow:Evaluation [9/30]
INFO:tensorflow:Evaluation [12/30]
INFO:tensorflow:Evaluation [15/30]
INFO:tensorflow:Evaluation [18/30]
INFO:tensorflow:Evaluation [21/30]
INFO:tensorflow:Evaluation [24/30]
INFO:tensorflow:Evaluation [27/30]
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Finished evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Saving dict for global step 150: accuracy = 0.8552381, global_step =150, loss = 0.95031387

From the logs, it seems that the training stops after the first evaluation step. What am I missing from the documentation? Could you explain me how I should have implemented what I think my code is doing?

Additional info I am running everything using the MNIST dataset, which has 50,000 images in the training set, so (I think) the model should run for *num_epochs*50,000/batch_size ≃ 7,000 steps*

I sincerely appreciate your help!

EDIT: after running experiments I realize that max_steps controls the number of steps of the whole training procedure, not just the amount of steps before computing the metrics on the test set. Reading tf.estimator.Estimator.train, I see it has a steps argument, which works incrementally and is bounded by max_steps; however, tf.estimator.TrainSpec does not have the steps argument, which means I cannot control the number of steps to take before computing metrics on the validation set.

629

asked Apr 02 '18 23:04

srcolinas

1 Answers

From my understanding, evaluation happens using a respawned model from the latest checkpoint. In your case, you don't save a checkpoint until 2000 steps. You also indicate max_steps=125, which will take precedence over the data set you feed your model.

Therefore, even though you indicate batch size of 70 and 100 epochs, your model has stopped training at 125 steps, which is well below the checkpoint limit of 2000 steps, which in turn limits evaluation, because evaluation depends on the checkpoint model.

Note by default, evaluation happens with every checkpoint save, assuming you don't set a throttle_secs limit.

108

answered Oct 27 '22 15:10

Michael Du

Related questions
                            
                                What does the order parameter in numpy.array() do AKA what is contiguous order?
                            
                                How to tell which specific compiler will be invoked for a Python C extension: GCC or Clang?
                            
                                Formatting Flask app logs in json
                            
                                Python Requests/urllib — monitoring bandwidth usage
                            
                                Using scikit-learn (sklearn), how to handle missing data for linear regression?
                            
                                pip install -U PySide error
                            
                                PyHook doesn't detect key pressed in some windows
                            
                                How to maintain Pandas DataFrame index order when using stack/unstack?
                            
                                How to count pymongo aggregation cursor without iterating
                            
                                Asyncio decode utf-8 with StreamReader
                            
                                argparse: some mutually exclusive arguments in required group
                            
                                Pandas map column in place
                            
                                Why are python static/class method not callable?
                            
                                Loading Images in a Directory As Tensorflow Data set
                            
                                '{0}'.format() is faster than str() and '{}'.format() using IPython %timeit and otherwise using pure Python
                            
                                Using the URLconf defined in mysite.urls, Django tried these URL patterns, in this order:
                            
                                PyCharm - Expected type 'Optional[IO[str]]', got 'TextIOWrapper[str]' instead
                            
                                What is the different between the get logger functions from celery.utils.log and logging?
                            
                                How to convert Python numpy array to base64 output
                            
                                What is the difference between a statement and a function in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With