Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow Data API - prefetch

I am trying to use new features of TF, namely Data API, and I am not sure how prefetch works. In the code below

def dataset_input_fn(...)
    dataset = tf.data.TFRecordDataset(filenames, compression_type="ZLIB")
    dataset = dataset.map(lambda x:parser(...))
    dataset = dataset.map(lambda x,y: image_augmentation(...)
                      , num_parallel_calls=num_threads
                     )

    dataset = dataset.shuffle(buffer_size)
    dataset = dataset.batch(batch_size)    
    dataset = dataset.repeat(num_epochs)
    iterator = dataset.make_one_shot_iterator()

does it matter between each lines above I put dataset=dataset.prefetch(batch_size)? Or maybe it should be after every operation that would be using output_buffer_size if the dataset was coming from tf.contrib.data?

like image 722
MPękalski Avatar asked Nov 01 '17 22:11

MPękalski


People also ask

What is prefetch in TensorFlow dataset?

Prefetching. Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s , the input pipeline is reading the data for step s+1 . Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data ...

What does prefetch () do python?

prefetch() function is used to produce a dataset that prefetches the specified elements from this given dataset. Parameters: This function accepts a parameter which is illustrated below: bufferSize: It is an integer value that specifies the number of elements to be prefetched.

What does From_tensor_slices do?

from_tensor_slices() It removes the first dimension and use it as a dataset dimension.

Does TF data use GPU?

If a TensorFlow operation has both CPU and GPU implementations, by default, the GPU device is prioritized when the operation is assigned. For example, tf. matmul has both CPU and GPU kernels and on a system with devices CPU:0 and GPU:0 , the GPU:0 device is selected to run tf.


1 Answers

In discussion on github I found a comment by mrry:

Note that in TF 1.4 there will be a Dataset.prefetch() method that makes it easier to add prefetching at any point in the pipeline, not just after a map(). (You can try it by downloading the current nightly build.)

and

For example, Dataset.prefetch() will start a background thread to populate a ordered buffer that acts like a tf.FIFOQueue, so that downstream pipeline stages need not block. However, the prefetch() implementation is much simpler, because it doesn't need to support as many different concurrent operations as a tf.FIFOQueue.

so it means prefetch could be put by any command and it works on the previous command. So far I have noticed the biggest performance gains by putting it only at the very end.

There is one more discussion on Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle where mrry explains a bit more about the prefetch and buffer.

UPDATE 2018/10/01:

From version 1.7.0 Dataset API (in contrib) has an option to prefetch_to_device. Note that this transformation has to be the last in the pipeline and when TF 2.0 arrives contrib will be gone. To have prefetch work on multiple GPUs please use MultiDeviceIterator (example see #13610) multi_device_iterator_ops.py.

https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/data/prefetch_to_device

like image 90
MPękalski Avatar answered Oct 07 '22 03:10

MPękalski