How to continouosly evaluate a tensorflow object detection model in parallel to training with model_main

Tags:

I successfully trained an object detection model with custom examples using train.py and eval.py. Running both programms in parallel I was able to visualize training and evaluation metrics in tensorboard during training.

However both programs were moved to the legacy folder and model_main.py seems to be the preferred way to run training and evaluation (by executing only a single process). However when I start model_main.py with the following pipeline.config:

train_config {
  batch_size: 1
  num_steps: 40000
  ...
}
eval_config {
  # entire evaluation set
  num_examples: 821
  # for continuous evaluation
  max_evals: 0
  ...
}

I see with enabled INFO logging in the output of model_main.py that training and evaluation are executed sequentially (as opposed to concurrently as before with two processes) and after every single training step a complete evaluation takes place.

INFO:tensorflow:Saving 'checkpoint_path' summary for global step 35932: ...
INFO:tensorflow:Saving checkpoints for 35933 into ...
INFO:tensorflow:Calling model_fn.
...
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-30-10:06:47
...
INFO:tensorflow:Restoring parameters from .../model.ckpt-35933
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [82/821]
...
INFO:tensorflow:Evaluation [738/821]
INFO:tensorflow:Evaluation [820/821]
INFO:tensorflow:Evaluation [821/821]
...
INFO:tensorflow:Finished evaluation at 2018-08-30-10:29:35
INFO:tensorflow:Saving dict for global step 35933: ...
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 35933: .../model.ckpt-35933
INFO:tensorflow:Saving checkpoints for 35934 into .../model.ckpt.
INFO:tensorflow:Calling model_fn.
...
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-30-10:29:56
...
INFO:tensorflow:Restoring parameters from .../model.ckpt-35934

This of course slows down training in a way that almost no progress is made. When I reduce the evaluation steps with model_main's command line parameter --num_eval_steps to 1 training is as fast as it was before (using train.py and eval.py), however the evaluation metrics become useless (e.g. the DetectionBoxes_Precision/mAP... become constant and have values like 1, 0 or even -1). To me it seems it is constantly computing these values for the same single image only.

So what is the right way to start model_main.py such that is does make reasonable fast progress and in parallel computes the evaluation metrics from the entire evaluation set.

291

asked Aug 30 '18 14:08

Volker Stampa

1 Answers

Inside training.py there's a class EvalSpec which is called in main_lib.py. Its constructor has a parameter called throttle_secs which sets the interval between consequent evaluations and has a default value of 600, and it never gets a different value in model_lib.py. If you have a specific value you want, you can simply change the default value, but the better practice of course is to pass it as a parameter of model_main.py which will feed EvalSpec through model_lib.py.

In more details, set it as another input flag flags.DEFINE_integer('throttle_secs', <DEFAULT_VALUE>, 'EXPLANATION'), then throttle_secs=FLAGS.throttle_secs, and then change model_lib.create_train_and_eval_specs to also receive throttle_secs, and inside it, add it to the call of tf.estimator.EvalSpec.

EDIT: I found out that you can also set eval_interval_secs in the eval_config of the .config file. In case this works (not all flags are supported since they moved from eval.py to model_main.py) - this is obviously a simpler solution. If not - use the solution above.

EDIT2: I tried using eval_interval_secs in eval_config, and it didn't work, so you should use the first solution.

answered Oct 20 '22 00:10

netanel-sam

Related questions
                            
                                Pyinstaller with Tensorflow takes incorrect path for _checkpoint_ops.so file
                            
                                Big HDF5 dataset, how to efficienly shuffle after each epoch
                            
                                How to reshape a tensor with multiple `None` dimensions?
                            
                                Tensorflow - How to freeze a .pb from the SavedModel to be used for inference in TensorFlowInferenceInterface?
                            
                                What is the equivalent of the following code in tensorflow?
                            
                                Slice tensor with variable indexes with Lambda Layer in Keras
                            
                                How to read trace file(timeline) in Tensorflow
                            
                                How to show hidden layer outputs in Tensorflow
                            
                                Create triangular matrix in tensorflow
                            
                                ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory (Windows)
                            
                                How to interpret tf.layers.dropout training arg
                            
                                Keras Lambda layer and variables : "TypeError: can't pickle _thread.lock objects"
                            
                                How to get back to default tensorflow version on google colab
                            
                                Tensorboard: Export CSV file from command line
                            
                                How to save Keras model progress into a file?
                            
                                Using tf.data.Dataset makes saved model bigger
                            
                                How to limit tensorflow memory usage?
                            
                                Keras model (tensorflow backend) trains perfectly well in python 3.5 but very bad in python 2.7
                            
                                how can I save a string data to TFRecord?
                            
                                keras CNN : train and validation set are identical but with different accuracy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to continouosly evaluate a tensorflow object detection model in parallel to training with model_main

Tags:

tensorflow

object-detection

Volker Stampa

People also ask

1 Answers

netanel-sam

Recent Activity

Donate For Us