Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is TensorFlow's tf.data.Dataset.shuffle so slow?

Tags:

tensorflow

The shuffle step in the following code works very slow for a moderate buffer_size (say 1000):

filenames = tf.constant(filenames)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(buffer_size)

If we use numpy to shuffle the data, the code looks as follows:

idx = np.arange(len(filenames))
np.random.shuffle(idx)
new_filenames = [filenames[i] for i in idx]
next_batch_filenames = new_filenames[:batch_size]
# get the corresponding files in batch

This is much faster. I wonder if TF does something beyond simply shuffles the data.

like image 680
user131379 Avatar asked Jan 13 '18 13:01

user131379


People also ask

What is shuffle buffer size TensorFlow?

For perfect shuffling, set the buffer size equal to the full size of the dataset. For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer.

Does tf data use GPU?

Stay organized with collections Save and categorize content based on your preferences. TensorFlow code, and tf.keras models will transparently run on a single GPU with no code changes required.

What does shuffle do in TensorFlow?

shuffle() method randomly shuffles a tensor along its first dimension.

What does tf data experimental autotune do?

data. AUTOTUNE , which will prompt the tf. data runtime to tune the value dynamically at runtime. Note that the prefetch transformation provides benefits any time there is an opportunity to overlap the work of a "producer" with the work of a "consumer."


1 Answers

As Anton Codes wrote, your first snippet shuffles batches of whatever _parse_function parses from your files (probably feature data), while your second snippet only shuffles filenames.

If shuffling on file level is sufficient, you can actually achieve (roughly) the same performance via the tf.data.Dataset API:

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames)) # shuffle file names
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)

This practice of shuffling "pointers" to your training samples instead of the samples themselves can often improve performance.

NumPy might still be a little bit more efficient though, due to the overhead of shuffling inside the computational graph (which tf.data.Dataset.shuffle does, there is actually a C++ kernel specifically for this operation).

The advantage of the tf.data.Dataset approach is that it can automatically reshuffle the Dataset after each epoch.

like image 141
Ernesto Elsäßer Avatar answered Oct 17 '22 02:10

Ernesto Elsäßer