A piece of code that takes well over 1 minute on the command line was done in a matter of seconds in NVIDIA Visual Profiler (running the same .exe). So the natural question is why? Is there something wrong with command line, or does Visual Profiler do something different and not really execute everything as on the command line?
I'm using CUBLAS, Thrust and cuRAND.
Incidentally, there's been a noticeable slowdown in compiled code on my machine very recently, even old code that previously ran quickly, hence I'm getting suspicious.
Update:
cudaFree(0)
call). In Visual Profiler that figure goes down to ~100MB.It is interesting to note that it is faster to perform the CPU task for small matrixes. Where for larger arrays, the CUDA outperforms the CPU by large margins. On a large scale, it looks like the CUDA times are not increasing, but if we only plot the CUDA times, we can see that it also increases linearly.
Profiling Overview The Visual Profiler is a graphical profiling tool that displays a timeline of your application's CPU and GPU activity, and that includes an automated analysis engine to identify optimization opportunities. The nvprof profiling tool enables you to collect and view profiling data from the command-line.
We announced that NVIDIA Visual Profiler and nvprof would be deprecated in a future version after CUDA Toolkit 10.1. More specifically: NVIDIA Visual Profiler and nvprof will end support at the Volta GPU architecture.
NVIDIA® Nsight™ Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. In addition, its baseline feature allows users to compare results within the tool.
This is an old question, but I've just finished chasing the same issue (though the cause may not be the same).
Namely: my app achieved between 900 and 1100 frames (synchronous launches) per second when running under NVVP, but around 100-120 when running outside of the profiler.
The cause appears to be a status message I was printing to the console via cout
. I had intended for this to only happen about once every 100-200 frames. Instead, it was printing the status message for every frame, and the console IO became the bottleneck.
By only printing the status message every 100 frames (though the optimal number here would depend on your application), the frame rate jumped back up to match what I was seeing in NVVP. Of course, this could also be handled in a separate CPU thread if that sort of overhead is unacceptable in your circumstances.
NVVP has to redirect stdout
to its own internal buffer in order to capture the application's output (which it shows in its console tab). It appears that NVVP's mechanism for buffering or processing that output has significantly less overhead than allowing the operating system to handle it directly. It looks like NVVP is buffering everything, and displaying it in a separate thread, or just saving a bunch of output until some threshold is reached, when it adds that buffer to its own console tab.
So, my advice would be to disable any console IO, and see if or how that affects things.
(It didn't help that VS2012 refused to profile my CUDA app. It would have been nice to see that 80% of the execution time was spent on console IO.)
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With