How to evaluate CUDA performance?

2 Answers

Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?

To find this out, you use one of the CUDA profilers. See How Do You Profile & Optimize CUDA Kernels?

How can I calcuate CUDA's maximum throughput theoretically?

That math is slightly involved, different for each architecture and easy to get wrong. Better to look the numbers up in the specs for your chip. There are tables on Wikipedia, such as this one, for the GTX500 cards. For instance, you can see from the table that a GTX580 has a theoretical peak bandwidth of 192.4GB/s and compute throughput of 1581.1GFLOPs.

Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?

If I understand correctly, you are asking if the number of theoretical peak GFLOPs on a GPU can be directly compared with the corresponding number on a CPU. There are some things to consider when comparing these numbers:

Older GPUs did not support double precision (DP) floating point, only single precision (SP).
GPUs that do support DP do so with a significant performance degradation as compared to SP. The GFLOPs number I quoted above was for SP. On the other hand, numbers quoted for CPUs are often for DP, and there is less difference between the performance of SP and DP on a CPU.
CPU quotes can be for rates that are achievable only when using SIMD (single instruction, multiple data) vectorized instructions, and is typically very hard to write algorithms that can approach the theoretical maximum (and they may have to be written in assembly). Sometimes, CPU quotes are for a combination of all computing resources available through different types of instructions and it often virtually impossible to write a program that can exploit them all simultaneously.
The rates quoted for GPUs assume that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth bound.

116

answered Nov 06 '22 21:11

Roger Dahl

The preferred measure of performance is elapsed time. GFLOPs can be used as a comparison method but it is often difficult to compare between compilers and architectures due to differences in instruction set, compiler code generation, and method of counting FLOPs.

The best method is to time the performance of the application. For the CUDA code you should time all code that will occur per launch. This includes memory copies and synchronization.

Nsight Visual Studio Edition and the Visual Profiler provide the most accurate measurement of each operation. Nsight Visual Studio Edition provides theoretical bandwidth and FLOPs values for each device. In addition the Achieved FLOPs experiment can be used to capture the FLOP count both for single and double precision.

answered Nov 06 '22 23:11

Greg Smith

Related questions
                            
                                Do we need the "Expect: 100-continue" header in the xfire request header?
                            
                                Faster math ops [duplicate]
                            
                                Why is jython slow? [closed]
                            
                                Create expression trees from given sets of numbers and operations and find those that evaluate to a target number in Mathematica 8 or above
                            
                                Most efficient way to create the Data.Set of all pairs of elements in a Set?
                            
                                Performance difference between VARCHAR2 to NUMBER
                            
                                c# slow socket speed
                            
                                does enterprise library 5.0 cache mappings between datareader and custom classes for accessor methods
                            
                                Comparing Root-finding (of a function) algorithms in Python
                            
                                Network traffic simulation test
                            
                                NoSQL db performance testing
                            
                                Can I globally "SET NOCOUNT ON" in TSQL for a query?
                            
                                What's the most efficient way to add social media "like" and "+1" buttons to your site?
                            
                                Convert.TryToInt64 instead of Convert.ToInt64?
                            
                                Creating a Linq expression dynamically containing a subquery
                            
                                Best practice to optimize javascript loading
                            
                                Faster proportion tables in R
                            
                                Establishing SQL Connection Taking 10 - 15 Seconds
                            
                                taskset equivalent in windows
                            
                                How can I improve the performance of this PHP/MySQL code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to evaluate CUDA performance?

Tags:

performance

cuda

bongmo.kim

People also ask

2 Answers

Roger Dahl

Greg Smith

Recent Activity

Donate For Us