Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TFRecords and record shuffling

Tags:

tensorflow

My understanding is that it is good practice to shuffle training samples for each epoch so that each mini-batch contains a nice random sample of the entire dataset. If I convert my entire data-set into a single file containing TFRecords then how is this shuffling to be achieved short of loading the entire data-set? My understanding is that there is no efficient random access to TFRecord files. So, to be specific, I am looking for guidance as to how TFRecord files are used in this scenario.

like image 204
bobw Avatar asked Feb 26 '16 16:02

bobw


2 Answers

It's not - you can improve the mixing somewhat by sharding your input into multiple input data files, and then treating them as explained in this answer.

If you need anything close to "perfect" shuffling, you would need to read it into memory, but in practice for most things, you'll probably get "good enough" shuffling by just splitting into 100 or 1000 files and then using a shuffle queue that's big enough to hold 8-16 files worth of data.

I have an itch in the back of my head to write an external random shuffle queue that can spill to disk, but it's very low on my priority list -- if someone wanted to contribute one, I'm volunteering to review it. :)

like image 131
dga Avatar answered Sep 22 '22 02:09

dga


Actually now you don't have to worry about shuffling before saving to TFRecords. It's because (currently) recommended method to read TFRecords uses tf.data.TFRecordDataset which implements .shuffle() method.

like image 32
bartgras Avatar answered Sep 22 '22 02:09

bartgras