Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regarding the use of tf.train.shuffle_batch() to create batches

Tags:

tensorflow

In Tensorflow tutorial, it gives the following example regarding tf.train.shuffle_batch():

# Creates batches of 32 images and 32 labels.
image_batch, label_batch = tf.train.shuffle_batch(
     [single_image, single_label],
     batch_size=32,
     num_threads=4,
     capacity=50000,
     min_after_dequeue=10000)

I am not very clear about the meaning of capacity and min_after_dequeue. In this example, it is set as 50000 and 10000 respectively. What is the logic for this kind of setup, or what does that mean. If input has 200 images and 200 labels, what will happen?

like image 437
user288609 Avatar asked Sep 02 '16 02:09

user288609


1 Answers

The tf.train.shuffle_batch() function uses a tf.RandomShuffleQueue internally to accumulate batches of batch_size elements, which are sampled uniformly at random from the elements currently in the queue.

Many training algorithms, such as the stochastic gradient descent–based algorithms that TensorFlow uses to optimize neural networks, rely on sampling records uniformly at random from the entire training set. However, it is not always practical to load the entire training set in memory (in order to sample from it), so tf.train.shuffle_batch() offers a compromise: it fills an internal buffer with between min_after_dequeue and capacity elements, and samples uniformly at random from that buffer. For many training processes, this improves the accuracy of the model and provides adequate randomization.

The min_after_dequeue and capacity arguments have an indirect effect on training performance. Setting a large min_after_dequeue value will delay the start of training, because TensorFlow has to process at least that many elements before training can start. The capacity is an upper bound on the amount of memory that the input pipeline will consume: setting this too large may cause the training process to run out of memory (and possibly start swapping, which will impair the training throughput).

If the dataset has only 200 images, it would be easily possible to load the entire dataset in memory. tf.train.shuffle_batch() would be quite inefficient, because it enqueue each image and label multiple times in the tf.RandomShuffleQueue. In this case, you may find it more efficient to do the following instead, using tf.train.slice_input_producer() and tf.train.batch():

random_image, random_label = tf.train.slice_input_producer([all_images, all_labels],
                                                           shuffle=True)

image_batch, label_batch = tf.train.batch([random_image, random_label],
                                          batch_size=32)
like image 106
mrry Avatar answered Nov 16 '22 07:11

mrry