In the Tensorflow reading data tutorial an example input pipeline is given. In that pipeline the data is shuffled twice, inside the string_input_producer
as well as in the shuffle batch generator
. Here is the code:
def input_pipeline(filenames, batch_size, num_epochs=None):
# Fist shuffle in the input pipeline
filename_queue = tf.train.string_input_producer(
filenames, num_epochs=num_epochs, shuffle=True)
example, label = read_my_file_format(filename_queue)
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size
# Second shuffle as part of the batching.
# Requiring min_after_dequeue preloaded images
example_batch, label_batch = tf.train.shuffle_batch(
[example, label], batch_size=batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
Does the second shuffle serve any useful purpose? The shuffle batch generator has the disadvantage that min_after_dequeue
examples are always stored pre-loaded in the memory to allow a useful shuffle. I do have image data which is quite heavy in memory consumption. That is why I am considering to use a normal batch generator
instead. Is there any advantage in shuffling the data twice?
Edit: Additional Question, why is the string_input_producer
initialized only with default capacity of 32? Wouldn't it be advantageous to have a multiple of batch_size as capacity?
Yes - this is a common pattern, and it's shown in the most general way. The string_input_producer
shuffles the order in which the data files are read. Each data file typically contains many examples, for efficiency. (Reading a million small files is very slow; it's better to read 1000 large files with 1000 examples each.)
Therefore, the examples from the files are read into a shuffling queue, where they are shuffled at a much finer granularity, so that examples from the same file aren't always trained in the same order, and to get mixing across the input files.
For more details, see Getting good mixing with many input datafiles in tensorflow
If your files each contain only one input example, you don't need to shuffle multiple times and could get away with only a string_input_producer
, but note that you still will likely benefit from having a queue that holds a few images after reading, so that you can overlap the inputting and training of your network. The queue_runner
for a batch
or shuffle_batch
will run in a separate thread, ensuring that the I/O is happening in the background and that images are always available for training. And, of course, it's typically nice for speed to create minibatches to train on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With