Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Effect of using page-able memory for asynchronous memory copy?

In CUDA C Best Practices Guide Version 5.0, Section 6.1.2, it is written that:

In contrast with cudaMemcpy(), the asynchronous transfer version requires pinned host memory (see Pinned Memory), and it contains an additional argument, a stream ID.

It means the cudaMemcpyAsync function should fail if I use simple memory.

But this is not what happened.

Just for testing purpose, I tried the following program:

Kernel:

__global__ void kernel_increment(float* src, float* dst, int n)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    if(tid<n)   
        dst[tid] = src[tid] + 1.0f;
}

Main:

int main()
{
    float *hPtr1, *hPtr2, *dPtr1, *dPtr2;

    const int n = 1000;

    size_t bytes = n * sizeof(float);

    cudaStream_t str1, str2;

    hPtr1 = new float[n];
    hPtr2 = new float[n];

    for(int i=0; i<n; i++)
        hPtr1[i] = static_cast<float>(i);

    cudaMalloc<float>(&dPtr1,bytes);
    cudaMalloc<float>(&dPtr2,bytes);

    dim3 block(16);
    dim3 grid((n + block.x - 1)/block.x);

    cudaStreamCreate(&str1);
    cudaStreamCreate(&str2);

    cudaMemcpyAsync(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice,str1);
    kernel_increment<<<grid,block,0,str2>>>(dPtr1,dPtr2,n);
    cudaMemcpyAsync(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost,str1);

    printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));

    cudaDeviceSynchronize();

    printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));

    cudaStreamDestroy(str1);
    cudaStreamDestroy(str2);

    cudaFree(dPtr1);
    cudaFree(dPtr2);

    for(int i=0; i<n; i++)
        std::cout<<hPtr2[i]<<std::endl;

    delete[] hPtr1;
    delete[] hPtr2;

    return 0;
}

The program gave correct output. The array incremented successfully.

How did cudaMemcpyAsync execute without page locked memory? Am I missing something here?

like image 884
sgarizvi Avatar asked Dec 30 '12 18:12

sgarizvi


People also ask

Is cudaMemcpy asynchronous?

Most CUDA calls are synchronous (often called “blocking”). An example of a blocking call is cudaMemcpy(). 1. Host call starts the copy (HostToDevice / DeviceToHost).

What is pinned memory in Cuda?

Pinned memory is virtual memory pages that are specially marked so that they cannot be paged out. They are allocated with special system API function calls. The important point for us is that CPU memory that serves as the source of destination of a DMA transfer must be allocated as pinned memory.

Which function is used to transfer data from source to destination in Cuda?

cudaMemcpy() − This API function is used for memory data transfer. It requires four parameters as input: Pointer to the destination, pointer to the source, amount of data to be copied (in bytes), and the direction of transfer.


1 Answers

cudaMemcpyAsync is fundamentally an asynchronous version of cudaMemcpy. This means that it doesn't block the calling host thread when the copy call is issued. That is the basic behaviour of the call.

Optionally, if the call is launched into the non default stream, and if the host memory is a pinned allocation, and the device has a free DMA copy engine, the copy operation can happen while the GPU simultaneously performs another operation: either kernel execution or another copy (in the case of a GPU with two DMA copy engines). If any of these conditions are not satisfied, the operation on the GPU is functionally identical to a standard cudaMemcpy call, ie. it serialises operations on the GPU, and no simultaneous copy-kernel execution or simultaneous multiple copies can occur. The only difference is that the operation doesn't block the calling host thread.

In your example code, the host source and destination memory are not pinned. So the memory transfer cannot overlap with kernel execution (ie. they serialise operations on the GPU). The calls are still asynchronous on the host. So what you have is functionally equivalent to:

cudaMemcpy(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice);
kernel_increment<<<grid,block>>>(dPtr1,dPtr2,n);
cudaMemcpy(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost);

with the exception that all the calls are asynchronous on the host, so the host thread blocks at the cudaDeviceSynchronize() call rather than at each of the memory transfer calls.

This is absolutely expected behaviour.

like image 54
talonmies Avatar answered Nov 11 '22 15:11

talonmies