I want to understand how pin_memory in Dataloader works. According to the documentation: <pre class="prettyprint"><code>pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory before returning them. </code></pre> Below is a self-contained code example. <pre class="prettyprint"><code>import torchvision import torch print('torch.cuda.is_available()', torch.cuda.is_available()) train_dataset = torchvision.datasets.CIFAR10(root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor()) train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, pin_memory=True) x, y = next(iter(train_dataloader)) print('x.device', x.device) print('y.device', y.device) </code></pre> Producing the following output: <pre class="prettyprint"><code>torch.cuda.is_available() True x.device cpu y.device cpu </code></pre> But I was expecting something like this, because I specified flag <code>pin_memory=True</code> in <code>Dataloader</code>. <pre class="prettyprint"><code>torch.cuda.is_available() True x.device cuda:0 y.device cuda:0 </code></pre> Also I run some benchmark: <pre class="prettyprint"><code>import torchvision import torch import time import numpy as np pin_memory=True train_dataset =torchvision.datasets.CIFAR10(root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor()) train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, pin_memory=pin_memory) print('pin_memory:', pin_memory) times = [] n_runs = 10 for i in range(n_runs): st = time.time() for bx, by in train_dataloader: bx, by = bx.cuda(), by.cuda() times.append(time.time() - st) print('average time:', np.mean(times)) </code></pre> I got the following results. <pre class="prettyprint"><code>pin_memory: False average time: 6.5701503753662 pin_memory: True average time: 7.0254474401474 </code></pre> So <code>pin_memory=True</code> only makes things slower. Can someone explain me this behaviour?

The documentation is perhaps overly laconic, given that the terms used are fairly niche. In CUDA terms, pinned memory does not mean GPU memory but non-paged CPU memory. The benefits and rationale are provided here, but the gist of it is that this flag allows the <code>x.cuda()</code> operation (which you still have to execute as usually) to avoid one implicit CPU-to-CPU copy, which makes it a bit more performant. Additionally, with pinned memory tensors you can use <code>x.cuda(non_blocking=True)</code> to perform the copy asynchronously with respect to host. This can lead to performance gains in certain scenarios, namely if your code is structured as <ol> <li><code>x.cuda(non_blocking=True)</code></li> <li>perform some CPU operations</li> <li>perform GPU operations using <code>x</code>.</li> </ol> Since the copy initiated in <code>1.</code> is asynchronous, it does not block <code>2.</code> from proceeding while the copy is underway and thus the two can happen side by side (which is the gain). Since step <code>3.</code> requires <code>x</code> to be already copied over to GPU, it cannot be executed until <code>1.</code> is complete - therefore only <code>1.</code> and <code>2.</code> can be overlapping, and <code>3.</code> will definitely take place afterwards. The duration of <code>2.</code> is therefore the maximum time you can expect to save with <code>non_blocking=True</code>. Without <code>non_blocking=True</code> your CPU would be waiting idle for the transfer to complete before proceeding with <code>2.</code>. Note: perhaps step <code>2.</code> could also comprise of GPU operations, as long as they do not require <code>x</code> - I am not sure if this is true and please don't quote me on that. Edit: I believe you're missing the point with your benchmark. There are three issues with it <ol> <li>You're not using <code>non_blocking=True</code> in your <code>.cuda()</code> calls.</li> <li>You're not using multiprocessing in your <code>DataLoader</code>, which means that most of the work is done synchronously on main thread anyway, trumping the memory transfer costs.</li> <li>You're not performing any CPU work in your data loading loop (aside from <code>.cuda()</code> calls) so there is no work to be overlaid with memory transfers.</li> </ol> A benchmark closer to how <code>pin_memory</code> is meant to be used would be <pre class="prettyprint"><code>import torchvision, torch, time import numpy as np pin_memory = True batch_size = 1024 # bigger memory transfers to make their cost more noticable n_workers = 6 # parallel workers to free up the main thread and reduce data decoding overhead train_dataset =torchvision.datasets.CIFAR10( root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor() ) train_dataloader = torch.utils.data.DataLoader( train_dataset, batch_size=batch_size, pin_memory=pin_memory, num_workers=n_workers ) print('pin_memory:', pin_memory) times = [] n_runs = 10 def work(): # emulates the CPU work done time.sleep(0.1) for i in range(n_runs): st = time.time() for bx, by in train_dataloader: bx, by = bx.cuda(non_blocking=pin_memory), by.cuda(non_blocking=pin_memory) work() times.append(time.time() - st) print('average time:', np.mean(times)) </code></pre> which gives an average of 5.48s for my machine with memory pinning and 5.72s without.

Pytorch. How does pin_memory work in Dataloader?

Tags:

deep-learning

pytorch

torch

I want to understand how pin_memory in Dataloader works.

According to the documentation:

pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory before returning them.

Below is a self-contained code example.

import torchvision import torch  print('torch.cuda.is_available()', torch.cuda.is_available()) train_dataset = torchvision.datasets.CIFAR10(root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor()) train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, pin_memory=True) x, y = next(iter(train_dataloader)) print('x.device', x.device) print('y.device', y.device)

Producing the following output:

torch.cuda.is_available() True x.device cpu y.device cpu

But I was expecting something like this, because I specified flag pin_memory=True in Dataloader.

torch.cuda.is_available() True x.device cuda:0 y.device cuda:0

Also I run some benchmark:

import torchvision import torch import time import numpy as np  pin_memory=True train_dataset =torchvision.datasets.CIFAR10(root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor()) train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, pin_memory=pin_memory) print('pin_memory:', pin_memory) times = [] n_runs = 10 for i in range(n_runs):     st = time.time()     for bx, by in train_dataloader:         bx, by = bx.cuda(), by.cuda()     times.append(time.time() - st) print('average time:', np.mean(times))

I got the following results.

pin_memory: False average time: 6.5701503753662  pin_memory: True average time: 7.0254474401474

So pin_memory=True only makes things slower. Can someone explain me this behaviour?

357

asked Apr 07 '19 20:04

Ivan Belonogov

1 Answers

The documentation is perhaps overly laconic, given that the terms used are fairly niche. In CUDA terms, pinned memory does not mean GPU memory but non-paged CPU memory. The benefits and rationale are provided here, but the gist of it is that this flag allows the x.cuda() operation (which you still have to execute as usually) to avoid one implicit CPU-to-CPU copy, which makes it a bit more performant. Additionally, with pinned memory tensors you can use x.cuda(non_blocking=True) to perform the copy asynchronously with respect to host. This can lead to performance gains in certain scenarios, namely if your code is structured as

x.cuda(non_blocking=True)
perform some CPU operations
perform GPU operations using x.

Since the copy initiated in 1. is asynchronous, it does not block 2. from proceeding while the copy is underway and thus the two can happen side by side (which is the gain). Since step 3. requires x to be already copied over to GPU, it cannot be executed until 1. is complete - therefore only 1. and 2. can be overlapping, and 3. will definitely take place afterwards. The duration of 2. is therefore the maximum time you can expect to save with non_blocking=True. Without non_blocking=True your CPU would be waiting idle for the transfer to complete before proceeding with 2..

Note: perhaps step 2. could also comprise of GPU operations, as long as they do not require x - I am not sure if this is true and please don't quote me on that.

Edit: I believe you're missing the point with your benchmark. There are three issues with it

You're not using non_blocking=True in your .cuda() calls.
You're not using multiprocessing in your DataLoader, which means that most of the work is done synchronously on main thread anyway, trumping the memory transfer costs.
You're not performing any CPU work in your data loading loop (aside from .cuda() calls) so there is no work to be overlaid with memory transfers.

A benchmark closer to how pin_memory is meant to be used would be

import torchvision, torch, time import numpy as np   pin_memory = True batch_size = 1024 # bigger memory transfers to make their cost more noticable n_workers = 6 # parallel workers to free up the main thread and reduce data decoding overhead train_dataset =torchvision.datasets.CIFAR10(     root='cifar10_pytorch',     download=True,     transform=torchvision.transforms.ToTensor() )    train_dataloader = torch.utils.data.DataLoader(     train_dataset,     batch_size=batch_size,     pin_memory=pin_memory,     num_workers=n_workers )    print('pin_memory:', pin_memory) times = [] n_runs = 10  def work():     # emulates the CPU work done     time.sleep(0.1)  for i in range(n_runs):     st = time.time()     for bx, by in train_dataloader:        bx, by = bx.cuda(non_blocking=pin_memory), by.cuda(non_blocking=pin_memory)        work()    times.append(time.time() - st) print('average time:', np.mean(times))

which gives an average of 5.48s for my machine with memory pinning and 5.72s without.

114

answered Sep 17 '22 16:09

Jatentaki

Related questions
                            
                                What is "unk" in the pretrained GloVe vector files (e.g. glove.6B.50d.txt)?
                            
                                Deep learning for image classification [closed]
                            
                                Difference between 1 LSTM with num_layers = 2 and 2 LSTMs in pytorch
                            
                                Implementing dropout from scratch
                            
                                Why doesn't my Deep Q Network master a simple Gridworld (Tensorflow)? (How to evaluate a Deep-Q-Net)
                            
                                How to use Batch Normalization correctly in tensorflow?
                            
                                Understanding Gradient Policy Deriving
                            
                                How to select batch size automatically to fit GPU?
                            
                                Does bias in the convolutional layer really make a difference to the test accuracy?
                            
                                How to understand masked multi-head attention in transformer
                            
                                caffe with multi-label images
                            
                                Understanding stateful LSTM [closed]
                            
                                How to decode encoded data from deep autoencoder in Keras (unclarity in tutorial)
                            
                                Keras for implement convolution neural network
                            
                                How to implement a deep bidirectional LSTM with Keras?
                            
                                PyTorch - How to get learning rate during training?
                            
                                What is a `"Python"` layer in caffe?
                            
                                Running the Tensorflow 2.0 code gives 'ValueError: tf.function-decorated function tried to create variables on non-first call'. What am I doing wrong?
                            
                                keras: what is the difference between model.predict and model.predict_proba
                            
                                Data Augmentation Image Data Generator Keras Semantic Segmentation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With