Why does CUDA code run so much faster in NVIDIA Visual Profiler?

Tags:

A piece of code that takes well over 1 minute on the command line was done in a matter of seconds in NVIDIA Visual Profiler (running the same .exe). So the natural question is why? Is there something wrong with command line, or does Visual Profiler do something different and not really execute everything as on the command line?

I'm using CUBLAS, Thrust and cuRAND.

Incidentally, there's been a noticeable slowdown in compiled code on my machine very recently, even old code that previously ran quickly, hence I'm getting suspicious.

Update:

I have checked that the calculated output on command line and Visual Profiler is identical - i.e. all required code has been run in both cases.
GPU-shark indicated that my performance state was unchanged at P0 when I switched from command line to Visual Profiler.
However, GPU usage was reported at 0.0% when run with Visual Profiler, but went as high as 98% when run off command line.
Moreover, far less memory is used with Visual Profiler. When run off command line, task manager indicates usage of 650-700MB of memory (spikes at the first cudaFree(0) call). In Visual Profiler that figure goes down to ~100MB.

208

asked May 15 '13 01:05

mchen

1 Answers

This is an old question, but I've just finished chasing the same issue (though the cause may not be the same).

Namely: my app achieved between 900 and 1100 frames (synchronous launches) per second when running under NVVP, but around 100-120 when running outside of the profiler.

The cause appears to be a status message I was printing to the console via cout. I had intended for this to only happen about once every 100-200 frames. Instead, it was printing the status message for every frame, and the console IO became the bottleneck.

By only printing the status message every 100 frames (though the optimal number here would depend on your application), the frame rate jumped back up to match what I was seeing in NVVP. Of course, this could also be handled in a separate CPU thread if that sort of overhead is unacceptable in your circumstances.

NVVP has to redirect stdout to its own internal buffer in order to capture the application's output (which it shows in its console tab). It appears that NVVP's mechanism for buffering or processing that output has significantly less overhead than allowing the operating system to handle it directly. It looks like NVVP is buffering everything, and displaying it in a separate thread, or just saving a bunch of output until some threshold is reached, when it adds that buffer to its own console tab.

So, my advice would be to disable any console IO, and see if or how that affects things.

(It didn't help that VS2012 refused to profile my CUDA app. It would have been nice to see that 80% of the execution time was spent on console IO.)

Hope this helps!

130

answered Sep 19 '22 23:09

3Dave

Related questions
                            
                                In an Android App, when are resources loaded to memory?
                            
                                R - slow performance in creating lots of data.table objects
                            
                                VS Code Intellisense is extremely slow
                            
                                Performance overhead of nested functions in Python
                            
                                Getting Can't find common super class of [com/loopj/android/http/MySSLSocketFactory] after enabling Pro-guard
                            
                                Java line-by-line Method/Function Profiling - Profiler &/or Eclipse Plugin [closed]
                            
                                Nginx and php-fpm: cannot get rid of 502 and 504 errors
                            
                                Are there any benchmarks comparing Java 8 Streams to Guava Iterables?
                            
                                Why is tail recursion optimization faster than normal recursion in Python?
                            
                                ViewStub vs. View.GONE
                            
                                how can I simulate network latency on my developer machine?
                            
                                Implementing a dynamic multiple timeline queue
                            
                                Performance regression when migrating from jdk1.7.0_25 to jdk1.7.0_40
                            
                                Real User Monitoring: Standards for reliability
                            
                                The program runs 3 times slower when compiled with g++ 5.3.1 than the same program compiled with g++ 4.8.4, the same command
                            
                                Accurately testing Pypy vs CPython performance
                            
                                How to improve a push data pipeline in C# to match F# in performance
                            
                                Algorithm to get changes between two arrays

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does CUDA code run so much faster in NVIDIA Visual Profiler?

Tags:

performance

cuda

gpgpu

mchen

People also ask

1 Answers

3Dave

Recent Activity

Donate For Us