Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding CUDA profiler output (nvprof)

I'm just looking at the following output and trying to wrap my mind around the numbers:

==2906== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 23.04%  10.9573s     16436  666.67us  64.996us  1.5927ms  sgemm_sm35_ldg_tn_32x16x64x8x16
 22.28%  10.5968s     14088  752.18us  612.13us  1.6235ms  sgemm_sm_heavy_nt_ldg
 18.09%  8.60573s     14088  610.86us  513.05us  1.2504ms  sgemm_sm35_ldg_nn_128x8x128x16x16
 16.48%  7.84050s     68092  115.15us  1.8240us  503.00us  void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>)
...
  0.25%  117.53ms      4744  24.773us     896ns  11.803ms  [CUDA memcpy HtoD]
  0.23%  107.32ms     37582  2.8550us  1.8880us  8.9556ms  [CUDA memcpy DtoH]

...

==2906== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 83.47%  41.8256s     42326  988.18us  16.923us  13.332ms  cudaMemcpy
  9.27%  4.64747s    326372  14.239us  10.846us  11.601ms  cudaLaunch
  1.49%  745.12ms   1502720     495ns     379ns  1.7092ms  cudaSetupArgument
  1.37%  688.09ms      4702  146.34us     879ns  615.09ms  cudaFree
...

When it comes to optimizing memory access, what are the numbers I really need to look at when comparing different implementations? It first looks like memcpy only takes 117.53+107.32ms (in both directions), but then there is this API call cudaMemcpy: 41.8256s, which is much more. Also, the min/avg/max columns don't add up between the upper and the lower output block.

Why is there a difference and what is the "true" number that is important for me to optimize the memory transfer?

EDIT: second question is: is there a way to figure out who is calling e.g. axpy_kernel_val (and how many times)?

like image 325
Pavel Avatar asked May 21 '15 10:05

Pavel


People also ask

What is Nvprof?

Nvprof is a command-line light-weight GUI-less profiler available for Linux, Windows, and Mac OS. This tool allows you to collect and view profiling data of CUDA-related activities on both CPU and GPU, including kernel execution, memory transfers, etc.

What is profiling explain CUDA profiling in details?

Profiling Overview The Visual Profiler is a graphical profiling tool that displays a timeline of your application's CPU and GPU activity, and that includes an automated analysis engine to identify optimization opportunities. The nvprof profiling tool enables you to collect and view profiling data from the command-line.

How do I find my CUDA cores?

If you have the nvidia-settings utilities installed, you can query the number of CUDA cores of your gpus by running nvidia-settings -q CUDACores -t . If that's not working, try nvidia-settings -q :0/CUDACores .

What is nsight compute?

NVIDIA® Nsight™ Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. In addition, its baseline feature allows users to compare results within the tool.


1 Answers

The difference in total time is due to the fact that work is launched to the GPU asynchronously. If you have a long running kernel or set of kernels with no explicit synchronisation to the host, and follow them with a call to cudaMemcpy, the cudaMemcpy call will be launched well before the kernel(s) have finished executing. The total time of the API call is from the moment it is launched to the moment it completes, so will overlap with executing kernels. You can see this very clearly if you run the output through the NVIDIA Visual Profiler (nvprof -o xxx ./myApp, then import xxx into nvvp).

The difference is min time is due to launch overhead. While the API profiling takes all of the launch overhead into account, the kernel timing only contains a small part of it. Launch overhead can be ~10-20us, as you can see here.

In general, the API calls section lets you know what the CPU is doing, while, the profiling results tells you what the GPU is doing. In this case, I'd argue you're underusing the CPU, as arguably the cudaMemcpy is launched too early and CPU cycles are wasted. In practice, however, it's often hard or impossible to get anything useful out of these spare cycles.

like image 98
Jez Avatar answered Oct 10 '22 09:10

Jez