Nvidia's nvprof outputs for FLOPS

Question

I see that nvprof can profile the number of flop in the kernel (using the parameters as below). Also when I browse through the documentation (here http://docs.nvidia.com/cuda... it says flop_count_sp is "Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special). Each multiply-accumulate operation contributes 2 to the count."

However when I run, the result of flop_count_sp (which is supposed to be flop_count_sp_add + flop_count_sp_mul + flop_count_sp_special + 2 * flop_count_sp_fma) I find that it does not include in the summation the value of flop_count_sp_special.

Could you suggest me what I am supposed to use? Should I add this value to the sum of flop_count_sp or I should consider the formula does not include the value of flop_count_sp_special?

Also could you please tell me what are these special operations?

I'm using the following command line:

nvprof --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma --metrics flops_sp_special myKernel args

Where myKernel is the name of my CUDA kernel which has some input arguments given by args.

A section of my nvprof outputs for instance is as shown below:

 ==20549== Profiling result:
 ==20549== Metric result:
 Invocations                               Metric Name                        Metric Description         Min         Max         Avg
 Device "Tesla K40c (0)"
    Kernel: mykernel(float*, int, int, float*, int, float*, int*)
           2                             flop_count_sp  Floating Point Operations(Single Precisi       70888       70888       70888
           2                         flop_count_sp_add  Floating Point Operations(Single Precisi       14465       14465       14465
           2                         flop_count_sp_mul  Floating Point Operation(Single Precisio       14465       14465       14465
           2                         flop_count_sp_fma  Floating Point Operations(Single Precisi       20979       20979       20979
           2                     flop_count_sp_special  Floating Point Operations(Single Precisi       87637       87637       87637

Tom · Accepted Answer

The "special" operations are listed in the arithmetic throughput table in the Programming Guide, they are: reciprocal, recip sqrt, log, exp, sin, cos. Note that these are less precise (but faster) than the default versions, you have to opt-in using the intrinsic or a compiler flag (-use_fast_math).

Despite what the documentation says, it seems the special operations are not included in the flop_count_sp total. That's a bug in the current version (8.0), I've filed a bug so it should be fixed in a future release (so this paragraph will be out of date at some point).

Nvidia's nvprof outputs for FLOPS

Tags:

cuda

nvprof

Amit

1 Answers

Tom

Recent Activity

Donate For Us

Nvidia's nvprof outputs for FLOPS

Tags:

cuda

nvprof

Amit

1 Answers

Tom

Related questions

Recent Activity

Donate For Us