Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Timing different sections in CUDA kernel

I have a CUDA kernel that calls out to a series of device functions.

What is the best way to get the execution time for each of the device functions?

What is the best way to get the execution time for a section of code in one of the device functions?

like image 525
Roger Dahl Avatar asked Jun 26 '12 14:06

Roger Dahl


People also ask

What are the limitations of CUDA kernel?

kernel cannot allocate, and only isbits types in device arrays: CUDA C has no garbage collection, and Julia has no manual deallocations, let alone on the device to deal with data that live independently of the CuArray. no try-catch-finally in kernel: CUDA C does not support exception handling on device (v11.

How do CUDA kernels work?

In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.

Are CUDA kernels blocking?

In CUDA, kernel launches are asynchronous (often called “non-blocking”). An example of kernel execution from host perspective: 1. Host call starts the kernel execution.

Can a CUDA kernel call another kernel?

Dynamic Parallelism in CUDA 5.0 enables a CUDA kernel to create and synchronize new nested work, using the CUDA runtime API to launch other kernels, optionally synchronize on kernel completion, perform device memory management, and create and use streams and events, all without CPU involvement.


1 Answers

In my own code, I use the clock() function to get precise timings. For convenience, I have the macros

enum {
    tid_this = 0,
    tid_that,
    tid_count
    };
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
 #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
 #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
 #define TIMER_TIC
 #define TIMER_TOC(tid)
#endif

These can then be used to instrument the device code as follows:

__global__ mykernel ( ... ) {

    /* Start the timer. */
    TIMER_TIC

    /* Do stuff. */
    ...

    /* Stop the timer and store the results to the "timer_this" counter. */
    TIMER_TOC( tid_this );

    }

You can then read the cuda_timers in the host code.

A few notes:

  • The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
  • Having said that, the timer assumes that the zeroth thread is active, so make sure you do not call these macros in a possibly divergent part of the code.
  • The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
  • The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
  • Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
  • The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.

P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.

like image 175
Pedro Avatar answered Oct 23 '22 02:10

Pedro