How to observe CUDA events and metrics for a subsection of an executable (e.g. only during a kernel execution time)?

Tags:

I'm familiar with using nvprof to access the events and metrics of a benchmark, e.g.,

nvprof --system-profiling on --print-gpu-trace -o (file name) --events inst_issued1 ./benchmarkname

The

system-profiling on --print-gpu-trace -o (filename)

command gives timestamps for start time, kernel end times, power, temp and saves the info an nvvp files so we can view it in the visual profiler. This allows us to see what's happening in any section of a code, in particular when a specific kernel is running. My question is this--

Is there a way to isolate the events counted for only a section of the benchmark run, for example during a kernel execution? In the command above,

--events inst_issued1

just gives the instructions tallied for the whole executable. Thanks!

281

asked Sep 17 '15 17:09

travelingbones

1 Answers

You may want to read the profiler documentation.

You can turn profiling on and off within an executable. The cuda runtime API for this is:

cudaProfilerStart() 
cudaProfilerStop()

So, if you wanted to collect profile information only for a specific kernel, you could do:

#include <cuda_profiler_api.h>
...

cudaProfilerStart();
myKernel<<<...>>>(...);
cudaProfilerStop();

and excerpting from the documentation:

When using the start and stop functions, you also need to instruct the profiling tool to disable profiling at the start of the application. For nvprof you do this with the --profile-from-start off flag. For the Visual Profiler you use the Start execution with profiling enabled checkbox in the Settings View.

Also from the documentation for nvprof specifically, you can limit event/metric tabulation to a single kernel with a command line switch:

 --kernels <kernel name>

The documentation gives additional usage possibilities.

answered Sep 22 '22 04:09

Robert Crovella

Related questions
                            
                                Calling a kernel from a kernel
                            
                                Control flow divergence in SIMT and SIMD
                            
                                Gradient Descent Optimization in CUDA
                            
                                How can I implement a custom atomic function involving several variables?
                            
                                Emulating FP64 with 2 FP32 on a GPU
                            
                                Tensorflow: CUDA_VISIBLE_DEVICES doesn't seem to work
                            
                                CMake CUDA separate compilation static lib link error on Windows but not on Ubuntu
                            
                                Expected number of bank conflicts in shared memory at random access
                            
                                how to link library (e.g. CUBLAS, CUSPARSE) for CUDA on windows
                            
                                Is it worthwhile to pass kernel parameters via shared memory?
                            
                                nvcc.exe linking error Microsoft Visual Studio configuration file 'vcvars64.bat' could not found
                            
                                using thrust::sort inside a thread
                            
                                Should I look into PTX to optimize my kernel? If so, how?
                            
                                Constant memory usage in CUDA code
                            
                                how to keep kernel code inside separate .cu file other than the main .cpp?
                            
                                Parallel implementation for multiple SVDs using CUDA
                            
                                What is the difference between __ldg() intrinsic and a normal execution?
                            
                                How to check if cuda is installed correctly on Anaconda
                            
                                Sorting 3 arrays by key in CUDA (using Thrust perhaps)
                            
                                check global device memory using cuda-gdb

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to observe CUDA events and metrics for a subsection of an executable (e.g. only during a kernel execution time)?

Tags:

profiling

cuda

nvprof

nvvp

travelingbones

People also ask

1 Answers

Robert Crovella

Recent Activity

Donate For Us