I am attempting to write a small demo program that has two cuda streams progressing and, governed by events, waiting for each other. So far this program looks like this:
// event.cu
#include <iostream>
#include <cstdio>
#include <cuda_runtime.h>
#include <cuda.h>
using namespace std;
__global__ void k_A1() { printf("\tHi! I am Kernel A1.\n"); }
__global__ void k_B1() { printf("\tHi! I am Kernel B1.\n"); }
__global__ void k_A2() { printf("\tHi! I am Kernel A2.\n"); }
__global__ void k_B2() { printf("\tHi! I am Kernel B2.\n"); }
int main()
{
cudaStream_t streamA, streamB;
cudaEvent_t halfA, halfB;
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamB);
cudaEventCreate(&halfA);
cudaEventCreate(&halfB);
cout << "Here is the plan:" << endl <<
"Stream A: A1, launch 'HalfA', wait for 'HalfB', A2." << endl <<
"Stream B: Wait for 'HalfA', B1, launch 'HalfB', B2." << endl <<
"I would expect: A1,B1, (A2 and B2 running concurrently)." << endl;
k_A1<<<1,1,0,streamA>>>(); // A1!
cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
k_A2<<<1,1,0,streamA>>>(); // A2!
cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
k_B1<<<1,1,0,streamB>>>(); // B1!
cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
k_B2<<<1,1,0,streamB>>>(); // B2!
cudaEventDestroy(halfB);
cudaEventDestroy(halfA);
cudaStreamDestroy(streamB);
cudaStreamDestroy(streamA);
cout << "All has been started. Synchronize!" << endl;
cudaDeviceSynchronize();
return 0;
}
My grasp of CUDA streams is the following: A stream is a kind of list to which I can add tasks. These tasks are tackled in series. So in my program I can rest assured that streamA would in order
and streamB would
Normally both streams might run asynchronous to each other. However, I would like to block streamB until A1 is done and then block streamA until B1 is done.
This appears not to be as simple. On my Ubuntu with Tesla M2090 (CC 2.0) the output of
nvcc -arch=sm_20 event.cu && ./a.out
is
Here is the plan:
Stream A: A1, launch 'HalfA', wait for 'HalfB', A2.
Stream B: Wait for 'HalfA', B1, launch 'HalfB', B2.
I would expect: A1,B1, (A2 and B2 running concurrently).
All has been started. Synchronize!
Hi! I am Kernel A1.
Hi! I am Kernel A2.
Hi! I am Kernel B1.
Hi! I am Kernel B2.
And I really would have expected B1 to be completed before the cudaEventRecord(halfB,streamB). Nevertheless stream A obviously does not wait for the completion of B1 and so not for the recording of halfB.
What's more: If I altogether delete the cudaEventRecord commands I would expect the program to lock down on the cudaStreamWait commands. But it does not and produces the same output. What am I overlooking here?
I think this is because "cudaStreamWaitEvent(streamA,halfB,0); " was called before "halfB" was recorded (cudaEventRecord(halfB,streamB);). It's likely that the cudaStreamWaitEvent call was searching for the closed "halfB" prior to it; since it was not found, it just quietly moved forward. See the following documentation:
The stream
stream
will wait only for the completion of the most recent host call tocudaEventRecord()
onevent
. Once this call has returned, any functions (includingcudaEventRecord()
andcudaEventDestroy()
) may be called onevent
again, and the subsequent calls will not have any effect onstream
.
I could not find a solution if you have to do a depth-first coding; however, the following code may lead to what you want:
k_A1<<<1,1,0,streamA>>>(d); // A1!
cudaEventRecord(halfA,streamA); // StreamA triggers halfA!
cudaStreamWaitEvent(streamB,halfA,0); // StreamB waits, for halfA.
k_B1<<<1,1,0,streamB>>>(d); // B1!
cudaEventRecord(halfB,streamB); // StreamB triggers halfB!
cudaStreamWaitEvent(streamA,halfB,0); // StreamA waits for halfB.
k_A2<<<1,1,0,streamA>>>(d); // A2!
k_B2<<<1,1,0,streamB>>>(d); // B2!
which is confirmed by the profiling:
Note that I changed the kernel interfaces.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With