I have a CUDA kernel that calls out to a series of device functions. What is the best way to get the execution time for each of the device functions? What is the best way to get the execution time for a section of code in one of the device functions?

In my own code, I use the <code>clock()</code> function to get precise timings. For convenience, I have the macros <pre class="prettyprint"><code>enum { tid_this = 0, tid_that, tid_count }; __device__ float cuda_timers[ tid_count ]; #ifdef USETIMERS #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock(); #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ); #else #define TIMER_TIC #define TIMER_TOC(tid) #endif </code></pre> These can then be used to instrument the device code as follows: <pre class="prettyprint"><code>__global__ mykernel ( ... ) { /* Start the timer. */ TIMER_TIC /* Do stuff. */ ... /* Stop the timer and store the results to the "timer_this" counter. */ TIMER_TOC( tid_this ); } </code></pre> You can then read the <code>cuda_timers</code> in the host code. A few notes: <ul> <li>The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.</li> <li>Having said that, the timer assumes that the zeroth thread is active, so make sure you do not call these macros in a possibly divergent part of the code.</li> <li>The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.</li> <li>The timers can slow down your code a bit, which is why I wrapped them in the <code>#ifdef USETIMERS</code> so you can switch them off easily.</li> <li>Although <code>clock()</code> returns integer values of type <code>clock_t</code>, I store the accumulated values as <code>float</code>, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).</li> <li>The selection <code>( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) )</code> is necessary in case the clock counter wraps around.</li> </ul> P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.

Timing different sections in CUDA kernel

1 Answers

In my own code, I use the clock() function to get precise timings. For convenience, I have the macros

enum {
    tid_this = 0,
    tid_that,
    tid_count
    };
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
 #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
 #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
 #define TIMER_TIC
 #define TIMER_TOC(tid)
#endif

These can then be used to instrument the device code as follows:

__global__ mykernel ( ... ) {

    /* Start the timer. */
    TIMER_TIC

    /* Do stuff. */
    ...

    /* Stop the timer and store the results to the "timer_this" counter. */
    TIMER_TOC( tid_this );

    }

You can then read the cuda_timers in the host code.

A few notes:

The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
Having said that, the timer assumes that the zeroth thread is active, so make sure you do not call these macros in a possibly divergent part of the code.
The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.

P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.

175

answered Oct 23 '22 02:10

Pedro

Related questions
                            
                                Why don't I get performance improvement by using get_unchecked()?
                            
                                optimizing of std::visit possible?
                            
                                Any experience combining JS / CSS in MVC?
                            
                                Boolean function optimizer package for Python
                            
                                Sqlite subselect much faster than distinct + order by
                            
                                Discrete optimization in python
                            
                                MySQL index on timestamp column not used for large date ranges
                            
                                Optimize nested if statements within a loop in C/C++ with GCC
                            
                                Why do these fixpoint cata / ana morphism definitions outperform the recursive ones?
                            
                                Why does the call latency on clock_gettime(CLOCK_REALTIME, ..) vary so much?
                            
                                Is there a performance difference between multiple "if" statements vs. "if else if" for mutually exclusive conditions?
                            
                                What may make non-optimized F# code faster than optimized code?
                            
                                Does each reference to a ResourceDictionary create a new instance, or do ResourceDictionaries have a caching mechanism
                            
                                Generalized Reduced Gradient (GRG2) Algorithm in R
                            
                                Optimizing for PyPy
                            
                                how to make [load on scroll] to not keep adding images into ram?
                            
                                What might cause the same SSE code to run a few times slower in the same function?
                            
                                Keras taking very long time to make first prediction following model.load()
                            
                                How to reduce code size of iPhone app?
                            
                                Pass an object as a parameter and modify it within the method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Timing different sections in CUDA kernel

Tags:

optimization

benchmarking

cuda

Roger Dahl

People also ask

1 Answers

Pedro

Recent Activity

Donate For Us