Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I shuffle a whole dataset with TensorFlow?

Now I use following function for shuffling

from tensorflow.contrib import data
def input_pipeline(filenames, batch_size):
    # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
    dataset = data.TextLineDataset(filenames)
    dataset = dataset.map(decode_func)
    dataset = dataset.shuffle(buffer_size=10000)  # Equivalent to min_after_dequeue=10000.
    dataset = dataset.batch(batch_size)

    # Return an *initializable* iterator over the dataset, which will allow us to
    # re-initialize it at the beginning of each epoch.
    return dataset.make_initializable_iterator() 

But it will just shuffle data at the amount of buffer_size and it will fill buffer in an order.

My data is enormous which I can not set buffer_size too big. Is there any other solutions I can shuffle the whole datasets?

like image 637
danche Avatar asked Jun 28 '17 02:06

danche


People also ask

How do you randomize a dataset in Python?

One of the easiest ways to shuffle a Pandas Dataframe is to use the Pandas sample method. The df. sample method allows you to sample a number of rows in a Pandas Dataframe in a random order. Because of this, we can simply specify that we want to return the entire Pandas Dataframe, in a random order.

What is shuffle buffer size TensorFlow?

For perfect shuffling, set the buffer size equal to the full size of the dataset. For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer.

Does keras shuffle data?

Keras Shuffle is always set to true by default, so even if you forget to provide it, your data will automatically be shuffled during training.

What is a prefetch dataset?

Dataset. prefetch transformation. It can be used to decouple the time when data is produced from the time when data is consumed. In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested.


1 Answers

Currently there is no support in Dataset API for shuffling a whole Dataset (greater then 10k examples). According to this thread, the common approach is:

  1. Randomly shuffle the entire data once using a MapReduce/Spark/Beam/etc. job to create a set of roughly equal-sized files ("shards").
  2. In each epoch:

    a. Randomly shuffle the list of shard filenames, using Dataset.list_files(...).shuffle(num_shards).

    b. Use dataset.interleave(lambda filename: tf.data.TextLineDataset(filename), cycle_length=N) to mix together records from N different shards.

    c. Use dataset.shuffle(B) to shuffle the resulting dataset. Setting B might require some experimentation, but you will probably want to set it to some value larger than the number of records in a single shard.

like image 128
zohar.kom Avatar answered Sep 21 '22 09:09

zohar.kom