Effect of using page-able memory for asynchronous memory copy?

Kernel:

__global__ void kernel_increment(float* src, float* dst, int n)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    if(tid<n)   
        dst[tid] = src[tid] + 1.0f;
}

Main:

int main()
{
    float *hPtr1, *hPtr2, *dPtr1, *dPtr2;

    const int n = 1000;

    size_t bytes = n * sizeof(float);

    cudaStream_t str1, str2;

    hPtr1 = new float[n];
    hPtr2 = new float[n];

    for(int i=0; i<n; i++)
        hPtr1[i] = static_cast<float>(i);

    cudaMalloc<float>(&dPtr1,bytes);
    cudaMalloc<float>(&dPtr2,bytes);

    dim3 block(16);
    dim3 grid((n + block.x - 1)/block.x);

    cudaStreamCreate(&str1);
    cudaStreamCreate(&str2);

    cudaMemcpyAsync(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice,str1);
    kernel_increment<<<grid,block,0,str2>>>(dPtr1,dPtr2,n);
    cudaMemcpyAsync(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost,str1);

    printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));

    cudaDeviceSynchronize();

    printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));

    cudaStreamDestroy(str1);
    cudaStreamDestroy(str2);

    cudaFree(dPtr1);
    cudaFree(dPtr2);

    for(int i=0; i<n; i++)
        std::cout<<hPtr2[i]<<std::endl;

    delete[] hPtr1;
    delete[] hPtr2;

    return 0;
}

The program gave correct output. The array incremented successfully.

How did cudaMemcpyAsync execute without page locked memory? Am I missing something here?

884

asked Dec 30 '12 18:12

sgarizvi

1 Answers

cudaMemcpyAsync is fundamentally an asynchronous version of cudaMemcpy. This means that it doesn't block the calling host thread when the copy call is issued. That is the basic behaviour of the call.

Optionally, if the call is launched into the non default stream, and if the host memory is a pinned allocation, and the device has a free DMA copy engine, the copy operation can happen while the GPU simultaneously performs another operation: either kernel execution or another copy (in the case of a GPU with two DMA copy engines). If any of these conditions are not satisfied, the operation on the GPU is functionally identical to a standard cudaMemcpy call, ie. it serialises operations on the GPU, and no simultaneous copy-kernel execution or simultaneous multiple copies can occur. The only difference is that the operation doesn't block the calling host thread.

In your example code, the host source and destination memory are not pinned. So the memory transfer cannot overlap with kernel execution (ie. they serialise operations on the GPU). The calls are still asynchronous on the host. So what you have is functionally equivalent to:

cudaMemcpy(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice);
kernel_increment<<<grid,block>>>(dPtr1,dPtr2,n);
cudaMemcpy(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost);

with the exception that all the calls are asynchronous on the host, so the host thread blocks at the cudaDeviceSynchronize() call rather than at each of the memory transfer calls.

This is absolutely expected behaviour.

answered Nov 11 '22 15:11

talonmies

Related questions
                            
                                How to perform Cross-Platform Asynchronous File I/O in C++
                            
                                jQuery and AJAX or server sent events
                            
                                Interact with Akka actor outside actors
                            
                                Swift: How do I return a value within an asynchronous urlsession function?
                            
                                IndexedDB getting all data with keys
                            
                                How do I schedule a conditional ContinueWith
                            
                                Is there a Python ElasticSearch client that supports asynchronous requests?
                            
                                Why awaiting cold Task does not throw
                            
                                Async method call and impersonation
                            
                                What are the scalability benefits of async (non-blocking) code?
                            
                                How to write a node.js function that waits for an event to fire before 'returning'?
                            
                                Entity Framework 6.1.0 SaveChangesAsync
                            
                                Raising PropertyChanged in asynchronous Task and UI Thread
                            
                                Swift closures [weak self] and async tasks
                            
                                How do I load page javascript after async loading of my manifest file
                            
                                Is .NET System.Net.CookieContainer thread safe?
                            
                                How to use async on an empty interface method
                            
                                Writing multithreaded methods using async/await in .Net 4.5
                            
                                Why do unawaited async methods not throw exceptions?
                            
                                What's the difference between Task.Yield, Task.Run, and ConfigureAwait(false)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Effect of using page-able memory for asynchronous memory copy?

Tags:

asynchronous

cuda