will cudaMalloc synchronize host and device?



I understand that cudaMemcpy will synchronize host and device, but how about cudaMalloc or cudaFree?

Basically I want to asynchronize memory allocation/copy and kernel executions on multiple GPU devices, and a simplified version of my code is something like this:

void wrapper_kernel(const int &ngpu, const float * const &data)
 some host codes;

int main()
 const int NGPU=3;
 static float *data[NGPU];
 for (int i=0; i<NGPU; i++) wrapper_kernel(i,data[i]);
 some host codes;

However, the GPUs are running sequentially, and can't find why.

People also ask

For what purpose cudaMalloc function is used?

You need allocate memory on both host and device before you can transfer data between host and device. We use malloc to allocate host memory space. We use the API function, cudaMalloc(), to allocate device memory.

Is cudaMemcpy synchronous?

Most CUDA calls are synchronous (often called “blocking”). An example of a blocking call is cudaMemcpy().

What is CUDA device synchronize?

Before we can use CUDA streams, we need to understand the notion of device synchronization. This is an operation where the host blocks any further execution until all operations issued to the GPU (memory transfers and kernel executions) have completed.

Is cudaFree synchronous?

cudaFree() is synchronous. If you really want it to be asynchronous, you can create your own CPU thread, give it a worker queue, and register cudaFree requests from your primary thread.

2 Answers

Try using cudaStream_t for each GPU. Below is simpleMultiGPU.cu taken from CUDA sample.

 //Solver config                                                          
TGPUplan      plan[MAX_GPU_COUNT];
//GPU reduction results                                                                                   
float     h_SumGPU[MAX_GPU_COUNT];

....memory init....

//Create streams for issuing GPU command asynchronously and allocate memory (GPU and System page-locked)                             for (i = 0; i < GPU_N; i++)
    //Allocate memory                                                                                                                    checkCudaErrors(cudaMalloc((void **)&plan[i].d_Data, plan[i].dataN * sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&plan[i].d_Sum, ACCUM_N * sizeof(float)));
    checkCudaErrors(cudaMallocHost((void **)&plan[i].h_Sum_from_device, ACCUM_N * sizeof(float)));
    checkCudaErrors(cudaMallocHost((void **)&plan[i].h_Data, plan[i].dataN * sizeof(float)));

    for (j = 0; j < plan[i].dataN; j++)
        plan[i].h_Data[j] = (float)rand() / (float)RAND_MAX;

....kernel, memory copyback....

and here's some guide of using multi gpu.

The most likely reason you are seeing GPU operations running sequentially is that cudaMalloc is asynchronous, but cudaFree is not (just queue them on the CPU thread and send the free requests at the end of a series of operations).

You need to use pinned memory for asynchronous transfers from CPU memory to GPU memory, Cuda provides two utility functions cudaMallocHost and cudaFreeHost (instead of malloc and free). BTW, there is also cudaHostAlloc for finer control, read the CUDA manual for details.

