Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow: Is preprocessing on TFRecord files faster than real-time data preprocessing?

In Tensorflow, it seems that preprocessing could be done on either during training time, when the batch is created from raw images (or data), or when the images are already static. Given that theoretically, the preprocessing should take roughly equal time (if they are done using the same hardware), is there any practical disadvantage in doing data preprocessing (or even data augmentation) before training than during training in real-time?

As a side question, could data augmentation even be done in Tensorflow if was not done during training?

like image 384
kwotsin Avatar asked Mar 11 '23 02:03

kwotsin


1 Answers

Is there any practical disadvantage in doing data preprocessing (or even data augmentation) before training than during training in real-time?

Yes, there are advantages (+++) and disadvantages (---):

Preprocessing before training:

  • --- preprocessed samples need to be stored: disk space consumption* (1)
  • --- only a "finite" amount of samples can be generated
  • +++ no runtime during training
  • ---... but samples always need be read from storage, i.e. maybe storage (disk) I/O becomes bottleneck
  • --- not flexible: changing datset/augmentation requires generating a new augmented dataset
  • +++ for Tensorflow: Easily work on numpy.ndarray or other dataformats with any high-level image API (open-cv, PIL, ...) to do augmentation or even use any other language/tool you like.

Preprocessing during training ("real-time"):

  • +++ infinite amount of samples can be generated (as it is generated on-the-fly)
  • +++ flexible: changing dataset/augmentation only requires changing code
  • +++ if dataset fits in memory, no disk I/O needed for data after reading once
  • --- adds runtime to your training* (2)
  • --- for Tensorflow: Building the preprocessing as part of the graph requires working with Tensors and restricts usage of APIs working on ndarrays or other formats.* (3)

Some specific aspects discussed in detail:

  • (1) Reproducing experiments "with the same data" is kind of straightforward with a dataset generated before training. However this can be solved (even more!) elegantly with storing a seed for real-time data generation.

  • (2): Training runtime for preprocessing: There are ways to avoid an expensive preprocessing pipeline to get in the way of your actual training. Tensorflow itself recommends filling Queues with many (CPU-)threads so that data generation can independently keep up with GPU data consumption. You can read more about this in the input pipeline performance guide.

  • (3): Data augmentation in tensorflow

    As a side question, could data augmentation even be done in Tensorflow if was not done during (I think you mean) before training?

    Yes, tensorflow offers some functionality to do augmentation. In terms of value augmentation of scalar/vector (or also more dimensional data), you can easily build something yourself with tf.multiply or other basic math ops. For image data, there are several ops implemented (see tf.image and tf.contrib.image), which should cover a lot of augmentation needs.

    There are off-the-shelf preprocessing examples on github, one of which is used and described in the CNN tutorial (cifar10).


Personally, I would always try to use real-time preprocessing, as generating (potentially huge) datasets feels clunky. But it is perfectly viable, I've seen it done many times and (as you see above) it definitely has it's advantages.

like image 79
Honeybear Avatar answered Apr 09 '23 16:04

Honeybear