When using the following code to train my network:
classifier = tf.estimator.Estimator(
model_fn=my_neural_network_model,
model_dir=some_path_to_save_checkpoints,
params={
some_parameters
}
)
classifier.train(input_fn=data_train_estimator, steps=step_num)
where data_train_estimator is defined as:
def data_train_estimator():
dataset = tf.data.TextLineDataset(train_csv_file).map(_parse_csv_train)
dataset = dataset.batch(100)
dataset = dataset.shuffle(1000)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
feature, label = iterator.get_next()
return feature, label
How does dataset.shuffle(1000) actually work?
More specifically,
Let's say I have 20000 images, batch size = 100, shuffle buffer size = 1000, and I train the model for 5000 steps.
1. For every 1000 steps, am I using 10 batches(of size 100), each independently taken from the same 1000 images in the shuffle buffer?
2.1 Does the shuffle buffer work like a moving window?
2.2 Or, does it randomly pick 1000 out of the 5000 images (with or without replacement)?
3. In the whole 5000 steps, how many different states has the shuffle buffer been in?
shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset. The buffer_size is the number of samples which are randomized and returned as tf.
Tensorflow recommends serializing and storing any dataset like CSVs, Images, Texts, etc., into a set of TFRecord files, each having a maximum size of around 100-200MB.
For perfect shuffling, set the buffer size equal to the full size of the dataset. For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer.
The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.
With shuffle_buffer=1000
you will keep a buffer in memory of 1000 points. When you need a data point during training, you will draw the point randomly from points 1-1000. After that there is only 999 points left in the buffer and point 1001 is added. The next point can then be drawn from the buffer.
To answer you in point form:
For every 1000 steps, am I using 10 batches(of size 100), each independently taken from the same 1000 images in the shuffle buffer?
No the image buffer will stay constant, but drawn images will be replaced with images not used before in that epoch.
Does the shuffle buffer work like a moving window? Or, does it randomly pick 1000 out of the 5000 images (with or without replacement)?
It draws without replacement and doesn't really work like a moving window, since drawn images are replaced dynamically.
In the whole 5000 steps, how many different states has the shuffle buffer been in?
Close to n_images * n_steps. So 25,000,000 in this case. There might be a few states that have been seen before by chance, but it is unlikely.
You might also find this question useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With