I see that nvprof can profile the number of flop in the kernel (using the parameters as below). Also when I browse through the documentation (here http://docs.nvidia.com/cuda... it says flop_count_sp is "Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special). Each multiply-accumulate operation contributes 2 to the count."
However when I run, the result of flop_count_sp
(which is supposed to be flop_count_sp_add
+ flop_count_sp_mul
+ flop_count_sp_special
+ 2 * flop_count_sp_fma
) I find that it does not include in the summation the value of flop_count_sp_special
.
Could you suggest me what I am supposed to use? Should I add this value to the sum of flop_count_sp
or I should consider the formula does not include the value of flop_count_sp_special
?
Also could you please tell me what are these special operations?
I'm using the following command line:
nvprof --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma --metrics flops_sp_special myKernel args
Where myKernel
is the name of my CUDA kernel which has some input arguments given by args.
A section of my nvprof outputs for instance is as shown below:
==20549== Profiling result:
==20549== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K40c (0)"
Kernel: mykernel(float*, int, int, float*, int, float*, int*)
2 flop_count_sp Floating Point Operations(Single Precisi 70888 70888 70888
2 flop_count_sp_add Floating Point Operations(Single Precisi 14465 14465 14465
2 flop_count_sp_mul Floating Point Operation(Single Precisio 14465 14465 14465
2 flop_count_sp_fma Floating Point Operations(Single Precisi 20979 20979 20979
2 flop_count_sp_special Floating Point Operations(Single Precisi 87637 87637 87637
The "special" operations are listed in the arithmetic throughput table in the Programming Guide, they are: reciprocal, recip sqrt, log, exp, sin, cos. Note that these are less precise (but faster) than the default versions, you have to opt-in using the intrinsic or a compiler flag (-use_fast_math
).
Despite what the documentation says, it seems the special operations are not included in the flop_count_sp total. That's a bug in the current version (8.0), I've filed a bug so it should be fixed in a future release (so this paragraph will be out of date at some point).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With