In Tensorflow, it seems that preprocessing could be done on either during training time, when the batch is created from raw images (or data), or when the images are already static. Given that theoretically, the preprocessing should take roughly equal time (if they are done using the same hardware), is there any practical disadvantage in doing data preprocessing (or even data augmentation) before training than during training in real-time?
As a side question, could data augmentation even be done in Tensorflow if was not done during training?
Is there any practical disadvantage in doing data preprocessing (or even data augmentation) before training than during training in real-time?
Yes, there are advantages (+++) and disadvantages (---):
Preprocessing before training:
Preprocessing during training ("real-time"):
Some specific aspects discussed in detail:
(1) Reproducing experiments "with the same data" is kind of straightforward with a dataset generated before training. However this can be solved (even more!) elegantly with storing a seed for real-time data generation.
(2): Training runtime for preprocessing: There are ways to avoid an expensive preprocessing pipeline to get in the way of your actual training. Tensorflow itself recommends filling Queues with many (CPU-)threads so that data generation can independently keep up with GPU data consumption. You can read more about this in the input pipeline performance guide.
(3): Data augmentation in tensorflow
As a side question, could data augmentation even be done in Tensorflow if was not done
during(I think you mean) before training?
Yes, tensorflow offers some functionality to do augmentation. In terms of value augmentation of scalar/vector (or also more dimensional data), you can easily build something yourself with tf.multiply or other basic math ops. For image data, there are several ops implemented (see tf.image and tf.contrib.image), which should cover a lot of augmentation needs.
There are off-the-shelf preprocessing examples on github, one of which is used and described in the CNN tutorial (cifar10).
Personally, I would always try to use real-time preprocessing, as generating (potentially huge) datasets feels clunky. But it is perfectly viable, I've seen it done many times and (as you see above) it definitely has it's advantages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With