I am trying to switch an algorithm that I had written from a Tesla T10 processor (compute capability 1.3) to a Tesla M2075 (compute capability 2.0). While switching I was surprised to find that my algorithm slowed down. I analyzed it and found that it seems to be because on the new machine the cuda streams are blocking. My algorithm has 3 main tasks that can be split and run in parallel: memory reorganization (which can be done on the CPU), memory copying from the host to the device, and the kernel execution on the device. On the old machine splitting the streams allowed the 3 tasks to overlap like this (all screenshots from the NVidia Visual Profiler):
However on the new machine the streams block before starting the CPU computation until the previous kernel is done executing, as can be seen here:
You can see the top row, all the orange blocks are the cudaStreamSynchronize calls which block until the previous kernel is done execution, even though that kernel is on a completely different stream. It seems to work for the first run through the streams and correctly parallelizes, but after that the problem starts, so I thought that maybe it was blocking on something and I tried to increase the number of streams which gave me this result:
Here you can see that for some reason only the first 4 streams are blocking, after that it starts parallelizing properly. As a last attempt I tried to hack around it by only using the first 4 streams for one time only and then switching to use the later streams but that still didn't work and it still stalled every 4 streams while letting the other streams execute concurrently:
So I am looking for any ideas as to what could be causing this problem and how to diagnose it. I have pored over my code and I don't think that it is a bug there, although I could be mistaken. Each stream is encapsulated in its own class and only has a reference to a single cudaStream_t which is a member of that class so I don't see how it could be referencing another stream and blocking on it.
Are there some changes to the way streams work between version 1.3 and 2.0 that I'm not aware of? Could it be something with shared memory not being freed and it having to wait on that? Any ideas for how to diagnose this problem are welcome, thanks.
PROBLEM 1: USING THE DEFAULT STREAM Symptoms —One stream will not overlap other streams In Cuda 5.0 stream 2 = default stream —Search for cudaEventRecord(event) , cudaMemcpyAsync(), etc. If stream is not specified it is placed into the default stream —Search for kernel launches in the default stream <<<a,b>>> Solutions
CUDA STREAMS A stream is a queue of device work —The host places work in the queue and continues on immediately —Device schedules work from streams when resources are free CUDA operations are placed within a stream —e.g. Kernel launches, memory copies
Synchronize host w.r.t. a specific stream cudaStreamSynchronize ( stream) Blocks host until all issued CUDA calls in stream are complete Synchronize host or devices using events More Synchronization Less Synchronization CUDA EVENTS Provide a mechanism to signal when operations have occurred in a stream
cudaStreamWaitEvent ( stream, event ) Blocks stream until event occurs Only blocks launches after this call Does not block the host! Common multi-threading mistake: Calling cudaEventSynchronize before cudaEventRecord
I cannot be completely sure without seeing code, but it looks like you may be having an issue with the order in which you enqueue your commands. There is a slight difference in the way compute capability 1.x and 2.x devices handle streams due to the fact that 2.x devices can run multiple kernels concurrently and handle both HtoD and DtoH simultaneously.
If you enqueue your commands in the order all HtoDs, all computes, all DtoHs you will have good results on Tesla cards (1060 et. al.).
If you order them copy HtoD, compute, copy DtoH, copy HtoD... etc. you will have good results on Fermi.
Kepler does equally well in both cases. This does matter across streams in both Tesla and Fermi cases, I suggest reading this NVIDIA post for more information. Overlapping across streams can be an extremely complicated problem, I wish you well. If you want further help, a general representation of the order in which you enqueue operations would be extremely helpful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With