Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow Tutorial: Duplicated Shuffling in the Input Pipeline

In the Tensorflow reading data tutorial an example input pipeline is given. In that pipeline the data is shuffled twice, inside the string_input_producer as well as in the shuffle batch generator. Here is the code:

def input_pipeline(filenames, batch_size, num_epochs=None):
  # Fist shuffle in the input pipeline
  filename_queue = tf.train.string_input_producer(
      filenames, num_epochs=num_epochs, shuffle=True)

  example, label = read_my_file_format(filename_queue)
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * batch_size
  # Second shuffle as part of the batching. 
  # Requiring min_after_dequeue preloaded images
  example_batch, label_batch = tf.train.shuffle_batch(
      [example, label], batch_size=batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)

  return example_batch, label_batch

Does the second shuffle serve any useful purpose? The shuffle batch generator has the disadvantage that min_after_dequeue examples are always stored pre-loaded in the memory to allow a useful shuffle. I do have image data which is quite heavy in memory consumption. That is why I am considering to use a normal batch generator instead. Is there any advantage in shuffling the data twice?

Edit: Additional Question, why is the string_input_producer initialized only with default capacity of 32? Wouldn't it be advantageous to have a multiple of batch_size as capacity?

like image 231
MarvMind Avatar asked Dec 19 '15 14:12

MarvMind


1 Answers

Yes - this is a common pattern, and it's shown in the most general way. The string_input_producer shuffles the order in which the data files are read. Each data file typically contains many examples, for efficiency. (Reading a million small files is very slow; it's better to read 1000 large files with 1000 examples each.)

Therefore, the examples from the files are read into a shuffling queue, where they are shuffled at a much finer granularity, so that examples from the same file aren't always trained in the same order, and to get mixing across the input files.

For more details, see Getting good mixing with many input datafiles in tensorflow

If your files each contain only one input example, you don't need to shuffle multiple times and could get away with only a string_input_producer, but note that you still will likely benefit from having a queue that holds a few images after reading, so that you can overlap the inputting and training of your network. The queue_runner for a batch or shuffle_batch will run in a separate thread, ensuring that the I/O is happening in the background and that images are always available for training. And, of course, it's typically nice for speed to create minibatches to train on.

like image 189
dga Avatar answered Oct 18 '22 00:10

dga