I understand that cudaMemcpy will synchronize host and device, but how about cudaMalloc or cudaFree?
Basically I want to asynchronize memory allocation/copy and kernel executions on multiple GPU devices, and a simplified version of my code is something like this:
void wrapper_kernel(const int &ngpu, const float * const &data)
{
cudaSetDevice(ngpu);
cudaMalloc(...);
cudaMemcpyAsync(...);
kernels<<<...>>>(...);
cudaMemcpyAsync(...);
some host codes;
}
int main()
{
const int NGPU=3;
static float *data[NGPU];
for (int i=0; i<NGPU; i++) wrapper_kernel(i,data[i]);
cudaDeviceSynchronize();
some host codes;
}
However, the GPUs are running sequentially, and can't find why.
You need allocate memory on both host and device before you can transfer data between host and device. We use malloc to allocate host memory space. We use the API function, cudaMalloc(), to allocate device memory.
Most CUDA calls are synchronous (often called “blocking”). An example of a blocking call is cudaMemcpy().
Before we can use CUDA streams, we need to understand the notion of device synchronization. This is an operation where the host blocks any further execution until all operations issued to the GPU (memory transfers and kernel executions) have completed.
cudaFree() is synchronous. If you really want it to be asynchronous, you can create your own CPU thread, give it a worker queue, and register cudaFree requests from your primary thread.
Try using cudaStream_t
for each GPU. Below is simpleMultiGPU.cu taken from CUDA sample.
//Solver config
TGPUplan plan[MAX_GPU_COUNT];
//GPU reduction results
float h_SumGPU[MAX_GPU_COUNT];
....memory init....
//Create streams for issuing GPU command asynchronously and allocate memory (GPU and System page-locked) for (i = 0; i < GPU_N; i++)
{
checkCudaErrors(cudaSetDevice(i));
checkCudaErrors(cudaStreamCreate(&plan[i].stream));
//Allocate memory checkCudaErrors(cudaMalloc((void **)&plan[i].d_Data, plan[i].dataN * sizeof(float)));
checkCudaErrors(cudaMalloc((void **)&plan[i].d_Sum, ACCUM_N * sizeof(float)));
checkCudaErrors(cudaMallocHost((void **)&plan[i].h_Sum_from_device, ACCUM_N * sizeof(float)));
checkCudaErrors(cudaMallocHost((void **)&plan[i].h_Data, plan[i].dataN * sizeof(float)));
for (j = 0; j < plan[i].dataN; j++)
{
plan[i].h_Data[j] = (float)rand() / (float)RAND_MAX;
}
}
....kernel, memory copyback....
and here's some guide of using multi gpu.
The most likely reason you are seeing GPU operations running sequentially is that cudaMalloc
is asynchronous, but cudaFree
is not (just queue them on the CPU thread and send the free requests at the end of a series of operations).
You need to use pinned memory for asynchronous transfers from CPU memory to GPU memory, Cuda provides two utility functions cudaMallocHost
and cudaFreeHost
(instead of malloc
and free
). BTW, there is also cudaHostAlloc
for finer control, read the CUDA manual for details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With