Now I use following function for shuffling
from tensorflow.contrib import data
def input_pipeline(filenames, batch_size):
# Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
dataset = data.TextLineDataset(filenames)
dataset = dataset.map(decode_func)
dataset = dataset.shuffle(buffer_size=10000) # Equivalent to min_after_dequeue=10000.
dataset = dataset.batch(batch_size)
# Return an *initializable* iterator over the dataset, which will allow us to
# re-initialize it at the beginning of each epoch.
return dataset.make_initializable_iterator()
But it will just shuffle data at the amount of buffer_size
and it will fill buffer
in an order.
My data is enormous which I can not set buffer_size
too big. Is there any other solutions I can shuffle the whole datasets?
One of the easiest ways to shuffle a Pandas Dataframe is to use the Pandas sample method. The df. sample method allows you to sample a number of rows in a Pandas Dataframe in a random order. Because of this, we can simply specify that we want to return the entire Pandas Dataframe, in a random order.
For perfect shuffling, set the buffer size equal to the full size of the dataset. For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer.
Keras Shuffle is always set to true by default, so even if you forget to provide it, your data will automatically be shuffled during training.
Dataset. prefetch transformation. It can be used to decouple the time when data is produced from the time when data is consumed. In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested.
Currently there is no support in Dataset API for shuffling a whole Dataset (greater then 10k examples). According to this thread, the common approach is:
- Randomly shuffle the entire data once using a MapReduce/Spark/Beam/etc. job to create a set of roughly equal-sized files ("shards").
In each epoch:
a. Randomly shuffle the list of shard filenames, using Dataset.list_files(...).shuffle(num_shards).
b. Use dataset.interleave(lambda filename: tf.data.TextLineDataset(filename), cycle_length=N) to mix together records from N different shards.
c. Use dataset.shuffle(B) to shuffle the resulting dataset. Setting B might require some experimentation, but you will probably want to set it to some value larger than the number of records in a single shard.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With