As stated in this question:
The tensorflow documentation does not provide any example of how to perform a periodic evaluation of the model on an evaluation set
The accepted answer suggested the use of Experiment (which is deprecated according to this README).
All I found on online points towards using the train_and_evaluate method. However, I still do not see how to switch between the two processes (train and evaluate). I have tried the following:
estimator = tf.estimator.Estimator(
model_fn=model_fn,
params=hparams,
model_dir=model_dir,
config = tf.estimator.RunConfig(
save_checkpoints_steps = 2000,
save_summary_steps = 100,
keep_checkpoint_max=5
)
)
train_input_fn = lambda: input_fn(
train_file, #a .tfrecords file
train=True,
batch_size=70,
num_epochs=100
)
eval_input_fn = lambda: input_fn(
val_file, # another .tfrecords file
train=False,
batch_size=70,
num_epochs=1
)
train_spec = tf.estimator.TrainSpec(
train_input_fn,
max_steps=125
)
eval_spec = tf.estimator.EvalSpec(
eval_input_fn,
steps=30,
name='validation',
start_delay_secs=150,
throttle_secs=200
)
tf.logging.info("start experiment...")
tf.estimator.train_and_evaluate(
estimator,
train_spec,
eval_spec
)
Here is what I think my code should be doing:
Train the model for 100 epochs using a batch size of 70; save checkpoints every 2000 batches; save summaries every 100 batches; keep at most 5 checkpoints; after 150 batches on the training set, compute the validation error using 30 batches of validation data
However, I get the following logs:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt.
INFO:tensorflow:loss = 39.55082, step = 1
INFO:tensorflow:global_step/sec: 178.622
INFO:tensorflow:loss = 1.0455043, step = 101 (0.560 sec)
INFO:tensorflow:Saving checkpoints for 150 into /output/model.ckpt.
INFO:tensorflow:Loss for final step: 0.8327793.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /projects/MNIST-GCP/output/model.ckpt-150
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [3/30]
INFO:tensorflow:Evaluation [6/30]
INFO:tensorflow:Evaluation [9/30]
INFO:tensorflow:Evaluation [12/30]
INFO:tensorflow:Evaluation [15/30]
INFO:tensorflow:Evaluation [18/30]
INFO:tensorflow:Evaluation [21/30]
INFO:tensorflow:Evaluation [24/30]
INFO:tensorflow:Evaluation [27/30]
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Finished evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Saving dict for global step 150: accuracy = 0.8552381, global_step =150, loss = 0.95031387
From the logs, it seems that the training stops after the first evaluation step. What am I missing from the documentation? Could you explain me how I should have implemented what I think my code is doing?
Additional info I am running everything using the MNIST dataset, which has 50,000 images in the training set, so (I think) the model should run for *num_epochs*50,000/batch_size ≃ 7,000 steps*
I sincerely appreciate your help!
EDIT: after running experiments I realize that max_steps controls the number of steps of the whole training procedure, not just the amount of steps before computing the metrics on the test set. Reading tf.estimator.Estimator.train, I see it has a steps argument, which works incrementally and is bounded by max_steps; however, tf.estimator.TrainSpec does not have the steps argument, which means I cannot control the number of steps to take before computing metrics on the validation set.
It is recommended using pre-made Estimators when just getting started. To write a TensorFlow program based on pre-made Estimators, you must perform the following tasks: Create one or more input functions. Define the model's feature columns.
Estimators simplify sharing implementations between model developers. You can develop a great model with high-level intuitive code, as they usually are easier to use if you need to create models compared to the low-level TensorFlow APIs. Estimators are themselves built on tf. keras.
What r the benefits of using estimator API ? You can train both locally and in a distributed model training environment. It provides a high-level API, simplifying model and development. It automatically saves summaries to TensorBoard.
From my understanding, evaluation happens using a respawned model from the latest checkpoint. In your case, you don't save a checkpoint until 2000 steps. You also indicate max_steps=125
, which will take precedence over the data set you feed your model.
Therefore, even though you indicate batch size of 70 and 100 epochs, your model has stopped training at 125 steps, which is well below the checkpoint limit of 2000 steps, which in turn limits evaluation, because evaluation depends on the checkpoint model.
Note by default, evaluation happens with every checkpoint save, assuming you don't set a throttle_secs
limit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With