Is CUDA pinned memory zero-copy?

Tags:

cuda

Pinned memory is supposed to increase transfer rates from host to device (api reference). However I found that I do not need to call cuMemcpyHtoD for the kernel to access the values, or cuMemcpyDtoA for the host to read the values back. I didn't think this would work, but it does:

__global__ void testPinnedMemory(double * mem)
{
    double currentValue = mem[threadIdx.x];
    printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
    mem[threadIdx.x] = currentValue+10;
}

void test() 
{
    const size_t THREADS = 8;
    double * pinnedHostPtr;
    cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault);

    //set memory values
    for (size_t i = 0; i < THREADS; ++i)
        pinnedHostPtr[i] = i;

    //call kernel
    dim3 threadsPerBlock(THREADS);
    dim3 numBlocks(1);
    testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr);

    //read output
    printf("Data after kernel execution: ");
    for (int i = 0; i < THREADS; ++i)
        printf("%f ", pinnedHostPtr[i]);    
    printf("\n");
}

Output:

Data after kernel execution: 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000
Thread id: 0, memory content: 0.000000
Thread id: 1, memory content: 1.000000
Thread id: 2, memory content: 2.000000
Thread id: 3, memory content: 3.000000
Thread id: 4, memory content: 4.000000
Thread id: 5, memory content: 5.000000
Thread id: 6, memory content: 6.000000
Thread id: 7, memory content: 7.000000

My questions are:

Is pinned memory zero-copy? I thought only mapped pinned memory was zero-copy.
If it is zero-copy why have an explicit way to map it to device (cudaHostAlloc with cudaHostAllocMapped option)

I'm using CUDA Toolkit 5.5, Quadro 4000 with driver set to TCC mode, and compilation options sm_20,compute_20

528

asked Feb 06 '14 18:02

Budric

2 Answers

Congratulations! You're encountering a 2.x compute capability + TCC + 64-bit OS feature with newer CUDA versions :)

Read the rest to find out more!

First a small theory summary as CUDA taught us:

Pinned memory is not zero-copy since the GPU cannot access it (it's not mapped in its address space) and it's used to efficiently transfer from the host to the GPU. It's page-locked (valuable kernel resource) memory and has some performance advantages over pageable normal memory.
Pinned zero-copy memory is page-locked memory (usually allocated with the cudaHostAllocMapped flag) which is also used by the GPU since mapped to its address space.

Why you're accessing memory allocated from the host from the device without explicitly specifying it?

Take a look at the release notes for CUDA 4.0 (and higher):

(Windows and Linux) Added support for unified virtual address space.

Devices supporting 64-bit and compute 2.0 and higher capability now share a single unified address space between the host and all devices. This means that the pointer used to access memory on the host is the same as the pointer to used to access memory on the device. Therefore, the location of memory may be queried directly from its pointer value; the direction of a memory copy need not be specified.

To summarize: if your card is 2.0+ (and it is: https://developer.nvidia.com/cuda-gpus), you are running a 64-bit OS and on Windows you have a TCC mode on, you're automatically using UVA (Unified Virtual Addressing) between host and device. That means: automatically enhancing your code with zero-copy-like accesses.

This is also in the CUDA documentation for the current version in the paragraph "Automatic Mapping of Host Allocated Host Memory"

139

answered Oct 22 '22 19:10

Marco A.

Mapped memory is a type of pinned memory. It is created when you pin the memory and pass the cudaHostAllocMapped flag. However, even though you've specified cudaHostAllocDefault, the memory is also "Mapped" under certain circumstances. I believe TCC mode combined with a 64-bit OS is sufficient to meet the circumstances required for the "auto-Mapped" feature.

The central issue is whether UVA is in effect. In your case, it is.

Regarding the question about why having the explicit capability, it is for use in cases where UVA is not in effect (for example in a 32-bit host OS).

from the documentation (when UVA is in effect):

Automatic Mapping of Host Allocated Host Memory

All host memory allocated through all devices using cudaMallocHost() and cudaHostAlloc() is always directly accessible from all devices that support unified addressing. This is the case regardless of whether or not the flags cudaHostAllocPortable and cudaHostAllocMapped are specified. The pointer value through which allocated host memory may be accessed in kernels on all devices that support unified addressing is the same as the pointer value through which that memory is accessed on the host. It is not necessary to call cudaHostGetDevicePointer() to get the device pointer for these allocations.

answered Oct 22 '22 17:10

Robert Crovella

Related questions
                            
                                CUDA Visual Profiler 'Interactive' X config option?
                            
                                How to include <cuda_runtime.h> in .cpp file
                            
                                c++ version supported by cuda 5.0
                            
                                Confusion with CUDA PTX code and register memory
                            
                                What utility/binary can I call to determine an nVIDIA GPU's Compute Capability?
                            
                                CUDA 4.1 printf() Error
                            
                                Calculating FLOPS (Floating-point Operations per Seconds)
                            
                                Does the nVidia RDMA GPUDirect always operate only physical addresses (in physical address space of the CPU)?
                            
                                How to asynchronously copy memory from the host to the device using thrust and CUDA streams
                            
                                How to debug into CUDA kernel code using visual studio 2008?
                            
                                What do work items execute when conditionals are used in GPU programming?
                            
                                Atomic Operations in CUDA? Which header file to include?
                            
                                Amdahl's law and GPU
                            
                                Cuda Clang and OS X Mavericks
                            
                                SVD speed in CPU and GPU
                            
                                All CUDA devices are used for display: Can not debug my CUDA-code from within desktop environment
                            
                                CUDA function call-able by either the device or host
                            
                                is there a better and a faster way to copy from CPU memory to GPU using thrust?
                            
                                CUDA coalesced access to global memory
                            
                                CUDA5 Examples: Has anyone translated some cutil definitions to CUDA5?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With