Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA streams not overlapping

I have something very similar to the code:

int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);

cudaMalloc(&g_in,  size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);

for (k = 0; k < no_streams; k++)
  mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);

cudaThreadSynchronize();

cudaFree(g_in);
cudaFree(g_out);

'h_ptr_in' and 'h_ptr_out' are arrays of pointers allocated with cudaMallocHost (with no flags).

The problem is that the streams do not overlap. In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.

I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right? And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap...

Please HELP, what can be causing this?

I'm running on:

Ubuntu 10.04 x64

Device: "GeForce GTX 460" (CUDA Driver Version: 3.20, CUDA Runtime Version: 3.20, CUDA Capability Major/Minor version number: 2.1, Concurrent copy and execution: Yes, Concurrent kernel execution: Yes)

like image 735
pmcr Avatar asked May 20 '11 10:05

pmcr


2 Answers

According to this post on the NVIDIA forums, the profiler will serialize streaming to get accurate timing data. If you think your timings are off, make sure you're using CUDA events...

I've been experimenting with streaming lately, and I found the "simpleMultiCopy" example from the SDK to be really helpful, particularly with the appropriate logic and synchronizations.

like image 193
tpm1510 Avatar answered Sep 19 '22 10:09

tpm1510


If you want to see the kernels overlap with kernels (concurrent kernels) you need to make use of CUDA Visual profiler 5.0 that comes with CUDA 5.0 Toolkit. I don't think previous profilers are capable of this. It should also show kernel and memcpy overlap.

like image 20
shadow Avatar answered Sep 20 '22 10:09

shadow