The shuffle
step in the following code works very slow for a moderate buffer_size (say 1000):
filenames = tf.constant(filenames)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(buffer_size)
If we use numpy
to shuffle the data, the code looks as follows:
idx = np.arange(len(filenames))
np.random.shuffle(idx)
new_filenames = [filenames[i] for i in idx]
next_batch_filenames = new_filenames[:batch_size]
# get the corresponding files in batch
This is much faster. I wonder if TF does something beyond simply shuffles the data.
For perfect shuffling, set the buffer size equal to the full size of the dataset. For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer.
Stay organized with collections Save and categorize content based on your preferences. TensorFlow code, and tf.keras models will transparently run on a single GPU with no code changes required.
shuffle() method randomly shuffles a tensor along its first dimension.
data. AUTOTUNE , which will prompt the tf. data runtime to tune the value dynamically at runtime. Note that the prefetch transformation provides benefits any time there is an opportunity to overlap the work of a "producer" with the work of a "consumer."
As Anton Codes wrote, your first snippet shuffles batches of whatever _parse_function
parses from your files (probably feature data), while your second snippet only shuffles filenames.
If shuffling on file level is sufficient, you can actually achieve (roughly) the same performance via the tf.data.Dataset
API:
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames)) # shuffle file names
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
This practice of shuffling "pointers" to your training samples instead of the samples themselves can often improve performance.
NumPy might still be a little bit more efficient though, due to the overhead of shuffling inside the computational graph (which tf.data.Dataset.shuffle
does, there is actually a C++ kernel specifically for this operation).
The advantage of the tf.data.Dataset
approach is that it can automatically reshuffle the Dataset after each epoch.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With