TFRecords and record shuffling

Question

My understanding is that it is good practice to shuffle training samples for each epoch so that each mini-batch contains a nice random sample of the entire dataset. If I convert my entire data-set into a single file containing TFRecords then how is this shuffling to be achieved short of loading the entire data-set? My understanding is that there is no efficient random access to TFRecord files. So, to be specific, I am looking for guidance as to how TFRecord files are used in this scenario.

dga · Accepted Answer

It's not - you can improve the mixing somewhat by sharding your input into multiple input data files, and then treating them as explained in this answer.

If you need anything close to "perfect" shuffling, you would need to read it into memory, but in practice for most things, you'll probably get "good enough" shuffling by just splitting into 100 or 1000 files and then using a shuffle queue that's big enough to hold 8-16 files worth of data.

I have an itch in the back of my head to write an external random shuffle queue that can spill to disk, but it's very low on my priority list -- if someone wanted to contribute one, I'm volunteering to review it. :)

bartgras · Answer

Actually now you don't have to worry about shuffling before saving to TFRecords. It's because (currently) recommended method to read TFRecords uses tf.data.TFRecordDataset which implements .shuffle() method.

TFRecords and record shuffling

Tags:

tensorflow

bobw

2 Answers

dga

bartgras

Recent Activity

Donate For Us

TFRecords and record shuffling

Tags:

tensorflow

bobw

2 Answers

dga

bartgras

Related questions

Recent Activity

Donate For Us