My understanding is that it is good practice to shuffle training samples for each epoch so that each mini-batch contains a nice random sample of the entire dataset. If I convert my entire data-set into a single file containing TFRecords then how is this shuffling to be achieved short of loading the entire data-set? My understanding is that there is no efficient random access to TFRecord files. So, to be specific, I am looking for guidance as to how TFRecord files are used in this scenario.
It's not - you can improve the mixing somewhat by sharding your input into multiple input data files, and then treating them as explained in this answer.
If you need anything close to "perfect" shuffling, you would need to read it into memory, but in practice for most things, you'll probably get "good enough" shuffling by just splitting into 100 or 1000 files and then using a shuffle queue that's big enough to hold 8-16 files worth of data.
I have an itch in the back of my head to write an external random shuffle queue that can spill to disk, but it's very low on my priority list -- if someone wanted to contribute one, I'm volunteering to review it. :)
Actually now you don't have to worry about shuffling before saving to TFRecords. It's because (currently) recommended method to read TFRecords uses tf.data.TFRecordDataset
which implements .shuffle()
method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With