Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to control when to compute evaluation vs training using the Estimator API of tensorflow?

As stated in this question:

The tensorflow documentation does not provide any example of how to perform a periodic evaluation of the model on an evaluation set

The accepted answer suggested the use of Experiment (which is deprecated according to this README).

All I found on online points towards using the train_and_evaluate method. However, I still do not see how to switch between the two processes (train and evaluate). I have tried the following:

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    params=hparams,
    model_dir=model_dir,
    config = tf.estimator.RunConfig(
        save_checkpoints_steps = 2000,
        save_summary_steps = 100,
        keep_checkpoint_max=5
    )
)

train_input_fn = lambda: input_fn(
    train_file, #a .tfrecords file
    train=True,
    batch_size=70,
    num_epochs=100
)

eval_input_fn = lambda: input_fn(
    val_file, # another .tfrecords file
    train=False,
    batch_size=70,
    num_epochs=1
)
train_spec = tf.estimator.TrainSpec(
    train_input_fn,
    max_steps=125
)    

eval_spec = tf.estimator.EvalSpec(
    eval_input_fn,
    steps=30,
    name='validation',
    start_delay_secs=150,
    throttle_secs=200
)

tf.logging.info("start experiment...")
tf.estimator.train_and_evaluate(
    estimator,
    train_spec,
    eval_spec
)

Here is what I think my code should be doing:

Train the model for 100 epochs using a batch size of 70; save checkpoints every 2000 batches; save summaries every 100 batches; keep at most 5 checkpoints; after 150 batches on the training set, compute the validation error using 30 batches of validation data

However, I get the following logs:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt.
INFO:tensorflow:loss = 39.55082, step = 1
INFO:tensorflow:global_step/sec: 178.622
INFO:tensorflow:loss = 1.0455043, step = 101 (0.560 sec)
INFO:tensorflow:Saving checkpoints for 150 into /output/model.ckpt.
INFO:tensorflow:Loss for final step: 0.8327793.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /projects/MNIST-GCP/output/model.ckpt-150
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [3/30]
INFO:tensorflow:Evaluation [6/30]
INFO:tensorflow:Evaluation [9/30]
INFO:tensorflow:Evaluation [12/30]
INFO:tensorflow:Evaluation [15/30]
INFO:tensorflow:Evaluation [18/30]
INFO:tensorflow:Evaluation [21/30]
INFO:tensorflow:Evaluation [24/30]
INFO:tensorflow:Evaluation [27/30]
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Finished evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Saving dict for global step 150: accuracy = 0.8552381, global_step =150, loss = 0.95031387

From the logs, it seems that the training stops after the first evaluation step. What am I missing from the documentation? Could you explain me how I should have implemented what I think my code is doing?

Additional info I am running everything using the MNIST dataset, which has 50,000 images in the training set, so (I think) the model should run for *num_epochs*50,000/batch_size ≃ 7,000 steps*

I sincerely appreciate your help!

EDIT: after running experiments I realize that max_steps controls the number of steps of the whole training procedure, not just the amount of steps before computing the metrics on the test set. Reading tf.estimator.Estimator.train, I see it has a steps argument, which works incrementally and is bounded by max_steps; however, tf.estimator.TrainSpec does not have the steps argument, which means I cannot control the number of steps to take before computing metrics on the validation set.

like image 629
srcolinas Avatar asked Apr 02 '18 23:04

srcolinas


People also ask

What kind of estimator model does TensorFlow recommend using for classification?

It is recommended using pre-made Estimators when just getting started. To write a TensorFlow program based on pre-made Estimators, you must perform the following tasks: Create one or more input functions. Define the model's feature columns.

What is Estimator API in TensorFlow?

Estimators simplify sharing implementations between model developers. You can develop a great model with high-level intuitive code, as they usually are easier to use if you need to create models compared to the low-level TensorFlow APIs. Estimators are themselves built on tf. keras.

What are the benefits of using the estimator API?

What r the benefits of using estimator API ? You can train both locally and in a distributed model training environment. It provides a high-level API, simplifying model and development. It automatically saves summaries to TensorBoard.


1 Answers

From my understanding, evaluation happens using a respawned model from the latest checkpoint. In your case, you don't save a checkpoint until 2000 steps. You also indicate max_steps=125, which will take precedence over the data set you feed your model.

Therefore, even though you indicate batch size of 70 and 100 epochs, your model has stopped training at 125 steps, which is well below the checkpoint limit of 2000 steps, which in turn limits evaluation, because evaluation depends on the checkpoint model.

Note by default, evaluation happens with every checkpoint save, assuming you don't set a throttle_secs limit.

like image 108
Michael Du Avatar answered Oct 27 '22 15:10

Michael Du