Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I delete events.out.tfevents.XXXXXXXXXX.computer_name files from training folder

I am training faster_rcnn_inception module for object detection on custom dataset. In training directory,we found folder called eval_0 and tensorflow generated events.out.tfevents.xxxxxx files.

Training Directory structure as follows

+training_dir
    +eval_0
     -events.out.tfevents.1542309785.instance-1  1.2GB
     -events.out.tfevents.1542367255.instance-1  5.3GB
     -events.out.tfevents.1542369886.instance-1  3.6GB
     -events.out.tfevents.1542624154.instance-1  31MB
     -events.out.tfevents.1543060258.instance-1  19MB
     -events.out.tfevents.1543066775.instance-2  1.6GB
 -events.out.tfevents.1542308099.instance-1  17MB
 -events.out.tfevents.1542308928.instance-1  17MB
 -events.out.tfevents.1542366369.instance-1  17MB
 -events.out.tfevents.1542369000.instance-1  17MB
 -events.out.tfevents.1542623262.instance-1  17MB
 -events.out.tfevents.1543064936.instance-2  17MB
 -events.out.tfevents.1543065796.instance-2  17MB
 -events.out.tfevents.1543065880.instance-2  17MB
 -model.ckpt-96004.data-00000-of-00001
 -model.ckpt-96004.data-00000-of-00001
 -model.ckpt-96004.index
 -model.ckpt-96004.meta
 -model.ckpt-96108.data-00000-of-00001
 -model.ckpt-96108.index
 -model.ckpt-96108.meta

As per my understanding, tfevents files in eval_0 folder are summery files of evaluation and tfevents files in training_dir are summery files of training.

I have interrupted training process several times and continued from recent checkpoint. I also understand restarting training process generates new tfevents files.

My Questions as follows:

  • Why training tfevents_files have same size, but in case if eval_0/tfevents_files size varies ?

  • Why interrupting training generates new tfevents_file in training folder, but same not observed in case of eval_0?

  • Can I delete all tfevents files in eval_0 except latest one? Does it affect on training or evolution history?

like image 234
Vardhman Patil Avatar asked Nov 28 '18 10:11

Vardhman Patil


2 Answers

tfevents files are not essential for training and can be safely removed.

In Tensorflow tfevents are created by FileWriters and are generally used to store summary output. Here are some common examples of how tf.summaries are used:

  • storing a description of the tensorflow graph before training starts
  • writing a value of the loss function for every training step
  • storing a histogram of activations or weights for a layer once per epoch
  • storing an example of output image of the network once on every validation
  • storing average precision (or any other metric) for the whole validation set

This information is not essential for training and can therefore be deleted. Yet, it might come in handy for debugging or studying behavior of the model. TensorBoard is the most common tool to read and visualize data stored in tfevent files. Anyone can read and interpret TFRecord files manually using protobuf protocol and it's implementation for Python, C++ and other.

tfevents are written in TFRecord format. TFRecord is a simple format for storing a sequence of binary records. Tensorflow always appends new events/summaries to the end of the file if file already exists. This explains file grows.

Due to details of implementation of optimization routine provided with tensorflow/models/reserach/object_detection training and evaluation event files have different behaviour. Namely, evaluation event file is created using a FileWriter directly, which will reuse latest existing event file in the log_dir whenever one exists. Implementation also has large number of summaries that are collected regularly, which increases event file during training.

For the training routine, on the other hand, developers explicitly specify an empty list of summaries when training is done on TPU. Which means that event file is created once and is never used afterwards. This behaviour can be different when training is performed on non-TPU hardware or summarize_gradients option is enabled for training.

like image 61
y.selivonchyk Avatar answered Nov 15 '22 04:11

y.selivonchyk


TFEvent files are mainly used by TensorBoard. If you open a terminal and start it (ie: tensorboard --logdir .) what you see is found in these event files.

Of course, you can have multiple "summary writers". In your case, events logged during training are logged in the root "training_dir" while the ones from the eval phase are put under "eval_0". You would want to do this because tensorboard plots each folder as a separate group in charts.

Your training data is different from your eval data, so the event files will of course be different as well.

As for checkpoints, all you need are the model.ckpt* files to restore the weights. The event files are not used at all, so you can safely delete them. Actually, you probably want to start with a clean log folder whenever you start a new training process if you are actually planning to use tensorboard.

The event files are not actually part of the checkpoints, they are log files. As such, whenever a logging method is called, new entries will be added. You don't see new entries in the eval_0 folder probably because you stopped the process during the training phase, before the evaluation phase was reached.

like image 20
Zac R. Avatar answered Nov 15 '22 04:11

Zac R.