I'm wondering why would pinning memory in PyTorch make things even slower. By reading the code of torch.utils.data.dataloader
, I found the pin_memory=True
option of DataLoader
simply calls .pin_memory()
on each batch before returning them. The returned tensor is still on CPU, and I have to call .cuda(non_blocking=True)
manually after this. Therefore, the whole process would be
for x in some_iter:
yield x.pin_memory().cuda(non_blocking=True)
I compared the performance of this with
for x in some_iter:
yield x.cuda()
Here is the actual code
a = torch.rand(1024, 655360)
%%time
for i in a:
i.pin_memory().cuda(non_blocking=True)
# CPU times: user 1.35 s, sys: 55.8 ms, total: 1.41 s
# Wall time: 396 ms
%%time
for i in a:
i.pin_memory().cuda()
# CPU times: user 1.6 s, sys: 12.2 ms, total: 1.62 s
# Wall time: 404 ms
%%time
for i in a:
i.cuda(non_blocking=True)
# CPU times: user 855 ms, sys: 3.87 ms, total: 859 ms
# Wall time: 274 ms
%%time
for i in a:
i.cuda()
# CPU times: user 314 ms, sys: 12 µs, total: 314 ms
# Wall time: 313 ms
As a result, not pinning memory both uses less CPU time, and is faster in terms of actual time. Shouldn't pinning memory make data transfer asynchronous and therefore be faster? If that's not the case, why would we do pin memory?
PS. I thought about the possibility of pinning a whole TensorDataset
in advance (rather than pinning batches each time). But this cannot pin a tensor that is bigger than GPU memory:
a = np.memmap('../dat/R/train.3,31,31B', '3,31,31B', 'r')
a.nbytes // 2**30
## 68
torch.from_numpy(a).pin_memory()
## ---------------------------------------------------------------------------
## RuntimeError Traceback (most recent call last)
## <ipython-input-36-d6f2d74da8e7> in <module>
## ----> 1 torch.from_numpy(a).pin_memory()
##
## RuntimeError: cuda runtime error (2) : out of memory at /tmp/pip-req-build-58y_cjjl/aten/src/THC/THCCachingHostAllocator.cpp:296
And if I do want to pin a small tensor, why don't I directly move the whole tensor into GPU memory in advance?
TL:DR
Your code is slower, because you allocate a new block of pinned memory each time you call the generator. Allocating new memory each time requires synchronization each time making it much slower than non-pinned memory. Likely, you are measuring this overhead.
Your code example in the edit fails in the THCCaching
HostAllocator.cpp
. It's not the GPU running out of memory, but your host denying you to allocate 68GB of pinned physical memory.
Pinning memory is actually slower in PyTorch?
Creating or releasing pinned memory (cudaHostAlloc()
/cudaFreeHost()
via the CUDA Runtime) is much slower than malloc
/free
because it involves synchronization between the devices (GPU and host). Likely, what you are measuring is - to a large extent - this overhead, as you are incrementally allocating pinned memory.
Shouldn't pinning memory make data transfer asynchronous and therefore be faster? If that's not the case, why would we do pin memory?
It can, but not if you halt/join to synchronize before each transfer in order to allocate the memory.
What pinning memory ultimately does is that it prevents the memory block from being swapped out by the OS; it is guaranteed to remain in RAM. This guarantee enables the GPU's DMA to operate on that block without going through the CPU (which has to check, among other things, if the data needs to be swapped back in). Thus, the CPU is free to do other stuff in the meantime.
It is not a perfect analogy, but you could think about pinned memory as shared memory between the GPU and the host. Both parties can operate on it without informing the other party; a bit like multiple threads in a process. This can be much faster if you implement non-blocking code. However, it can also be much slower if parties end up join
ing all the time.
Contrast this to the non-pinned approach, where the CPU loads the data from RAM (swapped in if necessary) and then sends it to the GPU. Not only is it slower (needs to go through the northbridge twice), but it also keeps the thread (and hence one CPU core) busy. Python also has the infamous GIL, so it could be that your entire application is waiting for that synchronous I/O.
If you want to use pinned memory to shuffle batches of data into the GPU, then one way to do it is to use pinned memory as a (circular) buffer. The CPU can load the data from disk, apply preprocessing, and place the batch into the buffer. The GPU can then fetch batches from the buffer in its own time and do the inference. If the implementation is done well, then the GPU will not idle more than necessary, and there is no more need for synchronization between the host and the GPU.
And if I do want to pin a small tensor, why don't I directly move the whole tensor into GPU memory in advance?
If you don't need to access the tensor from the CPU and it fits onto the GPU, then there is indeed no need to put it into pinned memory.
In your example, you are opening a memory-mapped numpy array memmap
, and then ask to transfer it to pinned memory. A memory-mapped file works very similar to paged memory in that data that doesn't fit the RAM anymore is flushed to disk, and loaded back in when it is accessed again.
This "swapping" can not happen for pinned memory, because we need to guarantee that the entire block resides in RAM at all dimes. Hence, we need to first load the entire array into host memory - a contiguous block of 68 GB -, likely creating a copy of the array in the process to not destroy the memmap
object, and then we need to pin that memory block, telling the host to forfeit 68GB of managed physical memory to our application. Either of these two steps can be denied by the OS and raise an OutOfMemory
error.
This is pretty much what you are seeing, as you fail in the THCCaching
HostAllocator.cpp
.
Answer from Pytorch dev:
"pinned memory is page-locked memory. It is easy for users to shoot themselves in the foot if they enable page-locked memory for everything, because it cant be pre-empted. That is why we did not make it default True" from here
It means depending on your current memory scenario (amount of RAM, fragmentation, etc) it may delay your system.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With