How can I shuffle a whole dataset with TensorFlow?

Tags:

shuffle

tensorflow

Now I use following function for shuffling

from tensorflow.contrib import data
def input_pipeline(filenames, batch_size):
    # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
    dataset = data.TextLineDataset(filenames)
    dataset = dataset.map(decode_func)
    dataset = dataset.shuffle(buffer_size=10000)  # Equivalent to min_after_dequeue=10000.
    dataset = dataset.batch(batch_size)

    # Return an *initializable* iterator over the dataset, which will allow us to
    # re-initialize it at the beginning of each epoch.
    return dataset.make_initializable_iterator()

But it will just shuffle data at the amount of buffer_size and it will fill buffer in an order.

My data is enormous which I can not set buffer_size too big. Is there any other solutions I can shuffle the whole datasets?

637

asked Jun 28 '17 02:06

danche

1 Answers

Currently there is no support in Dataset API for shuffling a whole Dataset (greater then 10k examples). According to this thread, the common approach is:

Randomly shuffle the entire data once using a MapReduce/Spark/Beam/etc. job to create a set of roughly equal-sized files ("shards").

In each epoch:

a. Randomly shuffle the list of shard filenames, using Dataset.list_files(...).shuffle(num_shards).

b. Use dataset.interleave(lambda filename: tf.data.TextLineDataset(filename), cycle_length=N) to mix together records from N different shards.

c. Use dataset.shuffle(B) to shuffle the resulting dataset. Setting B might require some experimentation, but you will probably want to set it to some value larger than the number of records in a single shard.

128

answered Sep 21 '22 09:09

zohar.kom

Related questions
                            
                                Tensorflow on Android with Python bindings?
                            
                                Is data augmentation in Keras applied to the validation set when using ImageDataGenerator and flow_from_directory
                            
                                Tensorflow: How to find the size of a tf.data.Dataset API object
                            
                                Same function in Keras Loss and Metric give different values even without regularization
                            
                                How to get rid of tensorflow verbose messages with Keras
                            
                                tf.train.init_from_checkpoint does not initialize variables created with tf.Variable
                            
                                How to list all used operations in Tensorflow SavedModel?
                            
                                How to use TensorFlow in OOP style?
                            
                                Tensorflow Serving - Stateful LSTM
                            
                                validation during training of Estimator
                            
                                Implementing im2col in TensorFlow
                            
                                A3C in Tensorflow - Should I use threading or the distributed Tensorflow API
                            
                                Custom combined hinge/kb-divergence loss function in siamese-net fails to generate meaningful speaker-embeddings
                            
                                Can inception model be used for object counting in an image?
                            
                                Tensorflow: What exact formula is applied in `tf.nn.sparse_softmax_cross_entropy_with_logits`?
                            
                                Getting reproducible results using tensorflow-gpu
                            
                                Nothing is being detected in Tensorflow Object detection API
                            
                                DEPRECATION WARNING: How to remove tf.keras warning "calling VarianceScaling.__init__ with dtype is deprecated..."
                            
                                Can Numba be used with Tensorflow?
                            
                                CTC Loss InvalidArgumentError: sequence_length(b) <= time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With