Pinned memory is supposed to increase transfer rates from host to device (api reference). However I found that I do not need to call cuMemcpyHtoD for the kernel to access the values, or cuMemcpyDtoA for the host to read the values back. I didn't think this would work, but it does:
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
void test()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr);
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
}
Output:
Data after kernel execution: 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000
Thread id: 0, memory content: 0.000000
Thread id: 1, memory content: 1.000000
Thread id: 2, memory content: 2.000000
Thread id: 3, memory content: 3.000000
Thread id: 4, memory content: 4.000000
Thread id: 5, memory content: 5.000000
Thread id: 6, memory content: 6.000000
Thread id: 7, memory content: 7.000000
My questions are:
I'm using CUDA Toolkit 5.5, Quadro 4000 with driver set to TCC mode, and compilation options sm_20,compute_20
– Pinned memory are virtual memory pages that are specially marked so that. they cannot be paged out. – Allocated with a special system API function call. – a.k.a. Page Locked Memory, Locked Pages, etc. – CPU memory that serve as the source or destination of a DMA transfer must.
Memory management on a CUDA device is similar to how it is done in CPU programming. You need to allocate memory space on the host, transfer the data to the device using the built-in API, retrieve the data (transfer the data back to the host), and finally free the allocated memory.
Shared memory is a CUDA memory space that is shared by all threads in a thread block. In this case shared means that all threads in a thread block can write and read to block-allocated shared memory, and all changes to this memory will be eventually available to all threads in the block.
Pinning a memory region prohibits the pager from stealing pages from the pages backing the pinned memory region. Memory regions defined in either system space or user space may be pinned. After a memory region is pinned, accessing that region does not result in a page fault until the region is subsequently unpinned.
Congratulations! You're encountering a 2.x compute capability + TCC + 64-bit OS feature with newer CUDA versions :)
Read the rest to find out more!
First a small theory summary as CUDA taught us:
Pinned memory is not zero-copy since the GPU cannot access it (it's not mapped in its address space) and it's used to efficiently transfer from the host to the GPU. It's page-locked (valuable kernel resource) memory and has some performance advantages over pageable normal memory.
Pinned zero-copy memory is page-locked memory (usually allocated with the cudaHostAllocMapped
flag) which is also used by the GPU since mapped to its address space.
Why you're accessing memory allocated from the host from the device without explicitly specifying it?
Take a look at the release notes for CUDA 4.0 (and higher):
- (Windows and Linux) Added support for unified virtual address space.
Devices supporting 64-bit and compute 2.0 and higher capability now share a single unified address space between the host and all devices. This means that the pointer used to access memory on the host is the same as the pointer to used to access memory on the device. Therefore, the location of memory may be queried directly from its pointer value; the direction of a memory copy need not be specified.
To summarize: if your card is 2.0+ (and it is: https://developer.nvidia.com/cuda-gpus), you are running a 64-bit OS and on Windows you have a TCC mode on, you're automatically using UVA (Unified Virtual Addressing) between host and device. That means: automatically enhancing your code with zero-copy-like accesses.
This is also in the CUDA documentation for the current version in the paragraph "Automatic Mapping of Host Allocated Host Memory"
Mapped memory is a type of pinned memory. It is created when you pin the memory and pass the cudaHostAllocMapped
flag. However, even though you've specified cudaHostAllocDefault
, the memory is also "Mapped" under certain circumstances. I believe TCC mode combined with a 64-bit OS is sufficient to meet the circumstances required for the "auto-Mapped" feature.
The central issue is whether UVA is in effect. In your case, it is.
Regarding the question about why having the explicit capability, it is for use in cases where UVA is not in effect (for example in a 32-bit host OS).
from the documentation (when UVA is in effect):
Automatic Mapping of Host Allocated Host Memory
All host memory allocated through all devices using cudaMallocHost() and cudaHostAlloc() is always directly accessible from all devices that support unified addressing. This is the case regardless of whether or not the flags cudaHostAllocPortable and cudaHostAllocMapped are specified. The pointer value through which allocated host memory may be accessed in kernels on all devices that support unified addressing is the same as the pointer value through which that memory is accessed on the host. It is not necessary to call cudaHostGetDevicePointer() to get the device pointer for these allocations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With