Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Tensorflow, when use dataset.shuffle(1000), am I only using 1000 data from my whole dataset?

When using the following code to train my network:

classifier = tf.estimator.Estimator(
    model_fn=my_neural_network_model, 
    model_dir=some_path_to_save_checkpoints,
    params={
        some_parameters
    }
)
classifier.train(input_fn=data_train_estimator, steps=step_num)

where data_train_estimator is defined as:

def data_train_estimator():
    dataset = tf.data.TextLineDataset(train_csv_file).map(_parse_csv_train)  
    dataset = dataset.batch(100)
    dataset = dataset.shuffle(1000)
    dataset = dataset.repeat()
    iterator = dataset.make_one_shot_iterator() 
    feature, label = iterator.get_next()
    return feature, label

How does dataset.shuffle(1000) actually work?

More specifically,

Let's say I have 20000 images, batch size = 100, shuffle buffer size = 1000, and I train the model for 5000 steps.

1. For every 1000 steps, am I using 10 batches(of size 100), each independently taken from the same 1000 images in the shuffle buffer?

2.1 Does the shuffle buffer work like a moving window?

2.2 Or, does it randomly pick 1000 out of the 5000 images (with or without replacement)?

3. In the whole 5000 steps, how many different states has the shuffle buffer been in?

like image 667
user10253771 Avatar asked Sep 11 '18 07:09

user10253771


People also ask

What does TensorFlow shuffle do?

shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset. The buffer_size is the number of samples which are randomized and returned as tf.

How much data can TensorFlow handle?

Tensorflow recommends serializing and storing any dataset like CSVs, Images, Texts, etc., into a set of TFRecord files, each having a maximum size of around 100-200MB.

How do you choose buffer size in shuffle?

For perfect shuffling, set the buffer size equal to the full size of the dataset. For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer.

How does TF dataset work?

The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.


1 Answers

With shuffle_buffer=1000 you will keep a buffer in memory of 1000 points. When you need a data point during training, you will draw the point randomly from points 1-1000. After that there is only 999 points left in the buffer and point 1001 is added. The next point can then be drawn from the buffer.

To answer you in point form:

For every 1000 steps, am I using 10 batches(of size 100), each independently taken from the same 1000 images in the shuffle buffer?

No the image buffer will stay constant, but drawn images will be replaced with images not used before in that epoch.

Does the shuffle buffer work like a moving window? Or, does it randomly pick 1000 out of the 5000 images (with or without replacement)?

It draws without replacement and doesn't really work like a moving window, since drawn images are replaced dynamically.

In the whole 5000 steps, how many different states has the shuffle buffer been in?

Close to n_images * n_steps. So 25,000,000 in this case. There might be a few states that have been seen before by chance, but it is unlikely.

You might also find this question useful.

like image 65
user2653663 Avatar answered Oct 09 '22 13:10

user2653663