Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proper Usage of PyTorch's non_blocking=True for Data Prefetching

I am looking into prefetching data into the GPU from the CPU when the model is being trained on the GPU. Overlapping CPU-to-GPU data transfer with GPU model training appears to require both

  1. Transferring data to GPU using data = data.cuda(non_blocking=True)
  2. Pin data to CPU memory using train_loader = DataLoader(..., pin_memory=True)

However, I cannot understand how non-blocking transfer is being performed in this official PyTorch example, specifically this code block:

for i, (images, target) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        if args.gpu is not None:
            images = images.cuda(args.gpu, non_blocking=True)
        if torch.cuda.is_available():
            target = target.cuda(args.gpu, non_blocking=True)

        # compute output
        output = model(images)
        loss = criterion(output, target)

Won't images.cuda(non_blocking=True) and target.cuda(non_blocking=True) have to be completed before output = model(images) is executed. Since this is a synchronization point, images must be first fully transferred to the CUDA device, so the data transfer steps are effectively no longer non-blocking.

Since output = model(images) is blocking, images.cuda() and target.cuda() in the next i iteration of the for loop will not occur until the model output is computed, meaning no prefetching in the next loop iteration.

If this is correct, what is the correct way to perform data prefetching to the GPU?

like image 937
Athena Wisdom Avatar asked Aug 18 '20 01:08

Athena Wisdom


People also ask

What does Non_blocking mean PyTorch?

aakashns (Aakash N S) February 22, 2021, 6:16am #3. non_blocking=True indicates that the tensor will be moved to the GPU in a background thread. So, if you try to access data immediately after executing the statement, it may still be on the CPU.

What is prefetching in machine learning?

Prefetching in storage systems is the process of preloading data from a slow. storage device into faster memory, generally DRAM, to decrease the overall. read latency. Accurate and timely prefetching can effectively reduce the perfor- mance gap between different levels of memory [30].

What is Cuda Non_blocking?

cuda(non_blocking=True) have to be completed before output = model(images) is executed. Since this is a synchronization point, images must be first fully transferred to the CUDA device, so the data transfer steps are effectively no longer non-blocking. Since output = model(images) is blocking, images.

What is pin memory PyTorch?

Pinned memory is used to speed up a CPU to GPU memory copy operation (as executed by e.g. tensor. cuda() in PyTorch) by ensuring that none of the memory that is to be copied is on disk. Memory cached to disk has to be read into RAM before it can be transferred to the GPU—e.g. it has to be copied twice.

How to implement data prefetcher in PyTorch?

The first approach of implementing data prefetcher is using non_blocking=True option just like NVIDIA did in their working version of data prefetcher in Apex project. However, for the first approach to work, the CPU tensor must be pinned (i.e. the pytorch dataloader should use the argument pin_memory=True ).

Is it possible to use GPU copy in PyTorch?

Quote from official PyTorch docs: Also, once you pin a tensor or storage, you can use asynchronous GPU copies. Just pass an additional non_blocking=True argument to a to () or a cuda () call. This can be used to overlap data transfers with computation.

What is PyTorch dataloader class?

At the heart of PyTorch data loading utility is the torch.utils.data.DataLoader class. It represents a Python iterable over a dataset.

How to load common datasets in PyTorch?

PyTorch includes packages to prepare and load common datasets for your model. At the heart of PyTorch data loading utility is the torch.utils.data.DataLoader class. It represents a Python iterable over a dataset.


1 Answers

I think where you are off is that output = model(images) is a synchronization point. It seems the computation is handled by a different part of a GPU. Quote from official PyTorch docs:

Also, once you pin a tensor or storage, you can use asynchronous GPU copies. Just pass an additional non_blocking=True argument to a to() or a cuda() call. This can be used to overlap data transfers with computation.

like image 110
S. Iqbal Avatar answered Nov 15 '22 10:11

S. Iqbal