I want to understand how pin_memory in Dataloader works.
According to the documentation:
pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory before returning them.
Below is a self-contained code example.
import torchvision import torch print('torch.cuda.is_available()', torch.cuda.is_available()) train_dataset = torchvision.datasets.CIFAR10(root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor()) train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, pin_memory=True) x, y = next(iter(train_dataloader)) print('x.device', x.device) print('y.device', y.device)
Producing the following output:
torch.cuda.is_available() True x.device cpu y.device cpu
But I was expecting something like this, because I specified flag pin_memory=True
in Dataloader
.
torch.cuda.is_available() True x.device cuda:0 y.device cuda:0
Also I run some benchmark:
import torchvision import torch import time import numpy as np pin_memory=True train_dataset =torchvision.datasets.CIFAR10(root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor()) train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, pin_memory=pin_memory) print('pin_memory:', pin_memory) times = [] n_runs = 10 for i in range(n_runs): st = time.time() for bx, by in train_dataloader: bx, by = bx.cuda(), by.cuda() times.append(time.time() - st) print('average time:', np.mean(times))
I got the following results.
pin_memory: False average time: 6.5701503753662 pin_memory: True average time: 7.0254474401474
So pin_memory=True
only makes things slower. Can someone explain me this behaviour?
If you load your samples in the Dataset on CPU and would like to push it during training to the GPU, you can speed up the host to device transfer by enabling pin_memory . This lets your DataLoader allocate the samples in page-locked memory, which speeds-up the transfer.
Pinned memory is used to speed up a CPU to GPU memory copy operation (as executed by e.g. tensor. cuda() in PyTorch) by ensuring that none of the memory that is to be copied is on disk. Memory cached to disk has to be read into RAM before it can be transferred to the GPU—e.g. it has to be copied twice.
A custom collate_fn can be used to customize collation, e.g., padding sequential data to a max length of a batch. collate_fn is called with a list of data samples at each time. It is expected to collate the input samples into a batch for yielding from the data loader iterator.
Pinned memory is virtual memory pages that are specially marked so that they cannot be paged out. They are allocated with special system API function calls. The important point for us is that CPU memory that serves as the source of destination of a DMA transfer must be allocated as pinned memory.
The documentation is perhaps overly laconic, given that the terms used are fairly niche. In CUDA terms, pinned memory does not mean GPU memory but non-paged CPU memory. The benefits and rationale are provided here, but the gist of it is that this flag allows the x.cuda()
operation (which you still have to execute as usually) to avoid one implicit CPU-to-CPU copy, which makes it a bit more performant. Additionally, with pinned memory tensors you can use x.cuda(non_blocking=True)
to perform the copy asynchronously with respect to host. This can lead to performance gains in certain scenarios, namely if your code is structured as
x.cuda(non_blocking=True)
x
.Since the copy initiated in 1.
is asynchronous, it does not block 2.
from proceeding while the copy is underway and thus the two can happen side by side (which is the gain). Since step 3.
requires x
to be already copied over to GPU, it cannot be executed until 1.
is complete - therefore only 1.
and 2.
can be overlapping, and 3.
will definitely take place afterwards. The duration of 2.
is therefore the maximum time you can expect to save with non_blocking=True
. Without non_blocking=True
your CPU would be waiting idle for the transfer to complete before proceeding with 2.
.
Note: perhaps step 2.
could also comprise of GPU operations, as long as they do not require x
- I am not sure if this is true and please don't quote me on that.
Edit: I believe you're missing the point with your benchmark. There are three issues with it
non_blocking=True
in your .cuda()
calls.DataLoader
, which means that most of the work is done synchronously on main thread anyway, trumping the memory transfer costs..cuda()
calls) so there is no work to be overlaid with memory transfers.A benchmark closer to how pin_memory
is meant to be used would be
import torchvision, torch, time import numpy as np pin_memory = True batch_size = 1024 # bigger memory transfers to make their cost more noticable n_workers = 6 # parallel workers to free up the main thread and reduce data decoding overhead train_dataset =torchvision.datasets.CIFAR10( root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor() ) train_dataloader = torch.utils.data.DataLoader( train_dataset, batch_size=batch_size, pin_memory=pin_memory, num_workers=n_workers ) print('pin_memory:', pin_memory) times = [] n_runs = 10 def work(): # emulates the CPU work done time.sleep(0.1) for i in range(n_runs): st = time.time() for bx, by in train_dataloader: bx, by = bx.cuda(non_blocking=pin_memory), by.cuda(non_blocking=pin_memory) work() times.append(time.time() - st) print('average time:', np.mean(times))
which gives an average of 5.48s for my machine with memory pinning and 5.72s without.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With