I replaced CIFAR-10 preprocessing pipeline in the project with Dataset API approach and it resulted in performance decrease of about 10-20%.
Preporcessing is rather standart: - read image from disk - make random/crop and flip - shuffle, batch - feed to the model
Overall i see that batche processing is now 15% faster, but every once in a while (or, more precisely, whenever I reinitialize dataframe or expect reshuffling) the batch is being blocked for up long time (30 sec) which totals to slower epoch-per-epoch processing.
This behaviour seems to do something with internal hashing. If I reduce N in ds.shuffle(buffer_size=N) delays are shorter but proportionally more frequent. Removing shuffle at all results to delays as if buffer_size was set to dataset size.
Can somebody explain internal logic of Dataset API when it comes to reading/caching? Is there any reason at all to expect Dataset API to work faster than manually created Queues?
I am using TF 1.3.
Data in Dataset API is lazy loaded, so it depends on later operations. Now you load 1024 samples at time because of the size of shuffle buffer. It needs to fill the shuffle buffer. Data will be then loaded lazily, when you will be fetching values from the iterator.
The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.
Stay organized with collections Save and categorize content based on your preferences. TensorFlow code, and tf. keras models will transparently run on a single GPU with no code changes required.
Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s , the input pipeline is reading the data for step s+1 . Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data.
If you implement the same pipeline using the tf.data.Dataset
API and using queues, the performance of the Dataset version should be better than the queue-based version.
However, there are a few performance best practices to observe in order to get the best performance. We have collected these in a performance guide for tf.data
. Here are the main issues:
Prefetching is important: the queue-based pipelines prefetch by default and the Dataset pipelines do not. Adding dataset.prefetch(1)
to the end of your pipeline will give you most of the benefit of prefetching, but you might need to tune this further.
The shuffle operator has a delay at the beginning, while it fills its buffer. The queue-based pipelines shuffle a concatenation of all epochs, which means that the buffer is only filled once. In a Dataset pipeline, this would be equivalent to dataset.repeat(NUM_EPOCHS).shuffle(N)
. By contrast, you can also write dataset.shuffle(N).repeat(NUM_EPOCHS)
, but this needs to restart the shuffling in each epoch. The latter approach is slightly preferable (and truer to the definition of SGD, for example), but the difference might not be noticeable if your dataset is large.
We are adding a fused version of shuffle-and-repeat that doesn't incur the delay, and a nightly build of TensorFlow will include the custom tf.contrib.data.shuffle_and_repeat()
transformation that is equivalent to dataset.shuffle(N).repeat(NUM_EPOCHS)
but doesn't suffer the delay at the start of each epoch.
Having said this, if you have a pipeline that is significantly slower when using tf.data
than the queues, please file a GitHub issue with the details, and we'll take a look!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With