Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there any guidelines on sharding a data set?

Tags:

tensorflow

Are there any guidelines on choosing the number of shard files for a data set, or the number of records in each shard?

In the examples of using tensorflow.contrib.slim,

  • there are roughly 1024 records in each shard of ImageNet data set.(tensorflow/models/inception)

  • there are roughly 600 records in each shard of flowers data set. (tensorflow/models/slim)

Do the number of shard files and the number of records in each shard has any impact on the training and the performance of the trained model?

To my knowledge, if we don't split the data set into multiple shards, it will be not quite random for shuffling data as the capacity of the RandomShuffleQueue may be less than the size of the data set.

Are there any other advantages of using multiple shards?


Update

The documentation says

If you have more reading threads than input files, to avoid the risk that you will have two threads reading the same example from the same file near each other.

Why can't we use 50 threads to read from 5 files?

like image 240
Jenny Avatar asked Dec 01 '25 02:12

Jenny


1 Answers

The newer(2.5) version of Tensorflow has shard feature for dataset. Find the below sample code from tensorflow documentation

A = tf.data.Dataset.range(10)
B = A.shard(num_shards=3, index=0)
list(B.as_numpy_iterator())  

When reading a single input file, you can shard elements as follows

d = tf.data.TFRecordDataset(input_file)
d = d.shard(num_workers, worker_index)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!