I am looking into prefetching data into the GPU from the CPU when the model is being trained on the GPU. Overlapping CPU-to-GPU data transfer with GPU model training appears to require both
data = data.cuda(non_blocking=True)
train_loader = DataLoader(..., pin_memory=True)
However, I cannot understand how non-blocking transfer is being performed in this official PyTorch example, specifically this code block:
for i, (images, target) in enumerate(train_loader):
# measure data loading time
data_time.update(time.time() - end)
if args.gpu is not None:
images = images.cuda(args.gpu, non_blocking=True)
if torch.cuda.is_available():
target = target.cuda(args.gpu, non_blocking=True)
# compute output
output = model(images)
loss = criterion(output, target)
Won't images.cuda(non_blocking=True)
and target.cuda(non_blocking=True)
have to be completed before output = model(images)
is executed. Since this is a synchronization point, images
must be first fully transferred to the CUDA device, so the data transfer steps are effectively no longer non-blocking.
Since output = model(images)
is blocking, images.cuda()
and target.cuda()
in the next i
iteration of the for
loop will not occur until the model output is computed, meaning no prefetching in the next loop iteration.
If this is correct, what is the correct way to perform data prefetching to the GPU?
aakashns (Aakash N S) February 22, 2021, 6:16am #3. non_blocking=True indicates that the tensor will be moved to the GPU in a background thread. So, if you try to access data immediately after executing the statement, it may still be on the CPU.
Prefetching in storage systems is the process of preloading data from a slow. storage device into faster memory, generally DRAM, to decrease the overall. read latency. Accurate and timely prefetching can effectively reduce the perfor- mance gap between different levels of memory [30].
cuda(non_blocking=True) have to be completed before output = model(images) is executed. Since this is a synchronization point, images must be first fully transferred to the CUDA device, so the data transfer steps are effectively no longer non-blocking. Since output = model(images) is blocking, images.
Pinned memory is used to speed up a CPU to GPU memory copy operation (as executed by e.g. tensor. cuda() in PyTorch) by ensuring that none of the memory that is to be copied is on disk. Memory cached to disk has to be read into RAM before it can be transferred to the GPU—e.g. it has to be copied twice.
The first approach of implementing data prefetcher is using non_blocking=True option just like NVIDIA did in their working version of data prefetcher in Apex project. However, for the first approach to work, the CPU tensor must be pinned (i.e. the pytorch dataloader should use the argument pin_memory=True ).
Quote from official PyTorch docs: Also, once you pin a tensor or storage, you can use asynchronous GPU copies. Just pass an additional non_blocking=True argument to a to () or a cuda () call. This can be used to overlap data transfers with computation.
At the heart of PyTorch data loading utility is the torch.utils.data.DataLoader class. It represents a Python iterable over a dataset.
PyTorch includes packages to prepare and load common datasets for your model. At the heart of PyTorch data loading utility is the torch.utils.data.DataLoader class. It represents a Python iterable over a dataset.
I think where you are off is that output = model(images)
is a synchronization point. It seems the computation is handled by a different part of a GPU. Quote from official PyTorch docs:
Also, once you pin a tensor or storage, you can use asynchronous GPU copies. Just pass an additional
non_blocking=True
argument to ato()
or acuda()
call. This can be used to overlap data transfers with computation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With