Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to continouosly evaluate a tensorflow object detection model in parallel to training with model_main

I successfully trained an object detection model with custom examples using train.py and eval.py. Running both programms in parallel I was able to visualize training and evaluation metrics in tensorboard during training.

However both programs were moved to the legacy folder and model_main.py seems to be the preferred way to run training and evaluation (by executing only a single process). However when I start model_main.py with the following pipeline.config:

train_config {
  batch_size: 1
  num_steps: 40000
  ...
}
eval_config {
  # entire evaluation set
  num_examples: 821
  # for continuous evaluation
  max_evals: 0
  ...
}

I see with enabled INFO logging in the output of model_main.py that training and evaluation are executed sequentially (as opposed to concurrently as before with two processes) and after every single training step a complete evaluation takes place.

INFO:tensorflow:Saving 'checkpoint_path' summary for global step 35932: ...
INFO:tensorflow:Saving checkpoints for 35933 into ...
INFO:tensorflow:Calling model_fn.
...
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-30-10:06:47
...
INFO:tensorflow:Restoring parameters from .../model.ckpt-35933
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [82/821]
...
INFO:tensorflow:Evaluation [738/821]
INFO:tensorflow:Evaluation [820/821]
INFO:tensorflow:Evaluation [821/821]
...
INFO:tensorflow:Finished evaluation at 2018-08-30-10:29:35
INFO:tensorflow:Saving dict for global step 35933: ...
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 35933: .../model.ckpt-35933
INFO:tensorflow:Saving checkpoints for 35934 into .../model.ckpt.
INFO:tensorflow:Calling model_fn.
...
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-30-10:29:56
...
INFO:tensorflow:Restoring parameters from .../model.ckpt-35934

This of course slows down training in a way that almost no progress is made. When I reduce the evaluation steps with model_main's command line parameter --num_eval_steps to 1 training is as fast as it was before (using train.py and eval.py), however the evaluation metrics become useless (e.g. the DetectionBoxes_Precision/mAP... become constant and have values like 1, 0 or even -1). To me it seems it is constantly computing these values for the same single image only.

So what is the right way to start model_main.py such that is does make reasonable fast progress and in parallel computes the evaluation metrics from the entire evaluation set.

like image 291
Volker Stampa Avatar asked Aug 30 '18 14:08

Volker Stampa


People also ask

How do you evaluate an object's detection model?

To evaluate object detection models like R-CNN and YOLO, the mean average precision (mAP) is used. The mAP compares the ground-truth bounding box to the detected box and returns a score. The higher the score, the more accurate the model is in its detections.


1 Answers

Inside training.py there's a class EvalSpec which is called in main_lib.py. Its constructor has a parameter called throttle_secs which sets the interval between consequent evaluations and has a default value of 600, and it never gets a different value in model_lib.py. If you have a specific value you want, you can simply change the default value, but the better practice of course is to pass it as a parameter of model_main.py which will feed EvalSpec through model_lib.py.

In more details, set it as another input flag flags.DEFINE_integer('throttle_secs', <DEFAULT_VALUE>, 'EXPLANATION'), then throttle_secs=FLAGS.throttle_secs, and then change model_lib.create_train_and_eval_specs to also receive throttle_secs, and inside it, add it to the call of tf.estimator.EvalSpec.

EDIT: I found out that you can also set eval_interval_secs in the eval_config of the .config file. In case this works (not all flags are supported since they moved from eval.py to model_main.py) - this is obviously a simpler solution. If not - use the solution above.

EDIT2: I tried using eval_interval_secs in eval_config, and it didn't work, so you should use the first solution.

like image 72
netanel-sam Avatar answered Oct 20 '22 00:10

netanel-sam