Why is TensorFlow's tf.data.Dataset.shuffle so slow?

Tags:

tensorflow

The shuffle step in the following code works very slow for a moderate buffer_size (say 1000):

filenames = tf.constant(filenames)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(buffer_size)

If we use numpy to shuffle the data, the code looks as follows:

idx = np.arange(len(filenames))
np.random.shuffle(idx)
new_filenames = [filenames[i] for i in idx]
next_batch_filenames = new_filenames[:batch_size]
# get the corresponding files in batch

This is much faster. I wonder if TF does something beyond simply shuffles the data.

680

asked Jan 13 '18 13:01

1 Answers

As Anton Codes wrote, your first snippet shuffles batches of whatever _parse_function parses from your files (probably feature data), while your second snippet only shuffles filenames.

If shuffling on file level is sufficient, you can actually achieve (roughly) the same performance via the tf.data.Dataset API:

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames)) # shuffle file names
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)

This practice of shuffling "pointers" to your training samples instead of the samples themselves can often improve performance.

NumPy might still be a little bit more efficient though, due to the overhead of shuffling inside the computational graph (which tf.data.Dataset.shuffle does, there is actually a C++ kernel specifically for this operation).

The advantage of the tf.data.Dataset approach is that it can automatically reshuffle the Dataset after each epoch.

141

answered Oct 17 '22 02:10

Ernesto Elsäßer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is TensorFlow's tf.data.Dataset.shuffle so slow?

Tags:

tensorflow

user131379

People also ask

1 Answers

Ernesto Elsäßer

Recent Activity

Donate For Us

Why is TensorFlow's tf.data.Dataset.shuffle so slow?

Tags:

tensorflow

user131379

People also ask

1 Answers

Ernesto Elsäßer

Related questions

Recent Activity

Donate For Us