Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA Device To Device transfer expensive

Tags:

c++

cuda

fft

I have written some code to try to swap quadrants of a 2D matrix for FFT purposes, that is stored in a flat array.

    int leftover = W-dcW;

    T *temp;
    T *topHalf;
cudaMalloc((void **)&temp, dcW * sizeof(T));

    //swap every row, left and right
    for(int i = 0; i < H; i++)
    {
        cudaMemcpy(temp, &data[i*W], dcW*sizeof(T),cudaMemcpyDeviceToDevice);
        cudaMemcpy(&data[i*W],&data[i*W+dcW], leftover*sizeof(T), cudaMemcpyDeviceToDevice);
        cudaMemcpy(&data[i*W+leftover], temp, dcW*sizeof(T), cudaMemcpyDeviceToDevice); 
    }

cudaMalloc((void **)&topHalf, dcH*W* sizeof(T));
    leftover = H-dcH;
    cudaMemcpy(topHalf, data, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);
    cudaMemcpy(data, &data[dcH*W], leftover*W*sizeof(T), cudaMemcpyDeviceToDevice);
    cudaMemcpy(&data[leftover*W], topHalf, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);

Notice that this code takes device pointers, and does DeviceToDevice transfers.

Why does this seem to run so slow? Can this be optimized somehow? I timed this compared to the same operation on host using regular memcpy and it was about 2x slower.

Any ideas?

like image 384
Derek Avatar asked May 19 '11 19:05

Derek


People also ask

What is CUDA device synchronize?

Before we can use CUDA streams, we need to understand the notion of device synchronization. This is an operation where the host blocks any further execution until all operations issued to the GPU (memory transfers and kernel executions) have completed.

What is CUDA pinned memory?

– Pinned memory are virtual memory pages that are specially marked so that. they cannot be paged out. – Allocated with a special system API function call. – a.k.a. Page Locked Memory, Locked Pages, etc. – CPU memory that serve as the source or destination of a DMA transfer must.

What is global memory in CUDA?

Global memory can be considered the main memory space of the GPU in CUDA. It is allocated, and managed, by the host, and it is accessible to both the host and the GPU, and for this reason the global memory space can be used to exchange data between the two.

Which function transfer value from host to GPU?

Data transfers using host pinned memory use the same cudaMemcpy() syntax as transfers with pageable memory.


1 Answers

I ended up writing a kernel to do the swaps. This was indeed faster than the Device to Device memcpy operations

like image 94
Derek Avatar answered Sep 18 '22 20:09

Derek