Are there any guidelines on sharding a data set?

Question

Are there any guidelines on choosing the number of shard files for a data set, or the number of records in each shard?

In the examples of using tensorflow.contrib.slim,

there are roughly 1024 records in each shard of ImageNet data set.(tensorflow/models/inception)
there are roughly 600 records in each shard of flowers data set. (tensorflow/models/slim)

Do the number of shard files and the number of records in each shard has any impact on the training and the performance of the trained model?

To my knowledge, if we don't split the data set into multiple shards, it will be not quite random for shuffling data as the capacity of the RandomShuffleQueue may be less than the size of the data set.

Are there any other advantages of using multiple shards?

Update

The documentation says

If you have more reading threads than input files, to avoid the risk that you will have two threads reading the same example from the same file near each other.

Why can't we use 50 threads to read from 5 files?

Admin · Accepted Answer

The newer(2.5) version of Tensorflow has shard feature for dataset. Find the below sample code from tensorflow documentation

A = tf.data.Dataset.range(10)
B = A.shard(num_shards=3, index=0)
list(B.as_numpy_iterator())

When reading a single input file, you can shard elements as follows

d = tf.data.TFRecordDataset(input_file)
d = d.shard(num_workers, worker_index)

Are there any guidelines on sharding a data set?

Tags:

tensorflow

Jenny

1 Answers

Recent Activity

Donate For Us

Are there any guidelines on sharding a data set?

Tags:

tensorflow

Jenny

1 Answers

Related questions

Recent Activity

Donate For Us