Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is making multiple shards of your data with multiple threads minimize the training time?

Tags:

tensorflow

My main issue is : I have 204 GB training tfrecord file that has 2 million images, and 28GB for validation tf.record file, of 302900 images. it takes 8 hour to train one epoch and this will take 33 day for training. I want to speed that by using multiple threads and shards but I am little bit confused about couple of things.

In tf.data.Dataset API there is shard function , So in the documentation they mentioned the following about shard function :

Creates a Dataset that includes only 1/num_shards of this dataset.

This dataset operator is very useful when running distributed training, as it allows each worker to read a unique subset.

When reading a single input file, you can skip elements as follows:

d = tf.data.TFRecordDataset(FLAGS.input_file)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)

Important caveats:

Be sure to shard before you use any randomizing operator (such as shuffle). Generally it is best if the shard operator is used early in the dataset pipeline. >For example, when reading from a set of TFRecord files, shard before converting >the dataset to input samples. This avoids reading every file on every worker. The >following is an example of an efficient sharding strategy within a complete >pipeline:

d = Dataset.list_files(FLAGS.pattern)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.repeat()
d = d.interleave(tf.data.TFRecordDataset,
             cycle_length=FLAGS.num_readers, block_length=1)

d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)

and these are my question:

1- Is there any relation between the number of tf.records files and number of shards? is number of shards(worker) depend on the number of CPU you have it, or the number of tf.records files you have ? and how I create it, , just by setting the number of shards to specific number? , or we need to split the files to multiple files and then set specific number of shards. note the number of worker referred to number of shards

2- what is the benefit of creating multiple tf.records file? some people said here is is related with when you need to shuffle you tf.records in better way but with Shuufle method exists in tf.Dataset API we don't need to do that, and other people said here it is only to split your data to smaller sizes parts. My question Do I need to as first step to split my tf.records file to multiple files

3- Now we come to num_threads in the map function ( num_paralle_calls in new version of tensorflwo) should be the same as number of shards you have it . When I searched I found some people say that if you have 10 shards, and 2 threads, each thread will take 5 shards.

4- what about about the d.interleave function , I know how it works as was mention in this example. But again I missed the connection num_threads, cycle length for example

5- If I want to use multiple GPU should I use the shards? as mentioned in the accepted comment here

as a summary I am confused about the relation between ( number of tf.records files, num_shards(workers), cyclic length, num_thread(num_parallel_calls). And what is the better situation to create in order to minimize the training time for both cases ( using multiple GPUs, and using single GPU)

like image 552
W. Sam Avatar asked Nov 30 '17 19:11

W. Sam


1 Answers

I am a tf.data developer. Let's see if I can help answer your questions.

1) It sounds like you have one large file. To process it using multiple workers, it makes sense to split it into multiple smaller files so that the tf.data input pipeline can process different files on different workers (by the virtue of applying the shard function over the list of filenames).

2) If you do not split your single file into multiple records each worker will have to read the entire file, consuming n times more IO bandwidth (where n is the number of workerS).

3) The number of threads for the map transformation is independent of the number of shards. Each shard will be processed by num_parallel_calls on each worker. In general, it is reasonable to set num_parallel_calls proportional to the number of available cores on the worker.

4) The purpose of the interleave transformation is to combine multiple dataset (e.g. read from different TFRecord files) into a single dataset. Given your use case, I don't think you need to use it.

5) If you want to feed multiple GPUs, I would recommend going with the first option listed in the comment you referenced as it is the easiest. The shard-based solution requires creating multiple pipelines on each worker (on for each GPU).

To minimize training time (for single or multiple GPUs, you want to make sure that your input pipeline is producing data faster than the GPU(s) can process it. Performance tuning of an input pipeline is discussed here.

like image 200
jsimsa Avatar answered Nov 03 '22 04:11

jsimsa