I had a quick glance of the CUDA Programming guide w.r.t -use-fast-math optimizations, and although appendix C mention divisions to be converted to an intrinsic but there are no mention of multiplications. The reason I ask this question is, my kernel has a lot of multiplications. I am aware that NVCC would try to fuse multiplications and additions (when regular '*' and '+' operators are used, and that intrinsics are never merged into FMAD operations). But if my code is multiplication heavy, then would there be a benefit if rounding-off SP intrinsic like __fmul_rn
is used?
So there are two questions:
Does -use-fast-math option translate multiplications with '*' operator to SP instrinsics like __fmul_rn?
Could there be a performance benefit in hand-coding multiplications to explicitly use __fmul_rn? An example or some numbers would help me understand.
"Standalone" single precision multiplication always compiles to hardware instructions ("intrinsics"). There is no other type of floating point multiplication instructions. The -use_fast_math option in nvcc has no effect on the floating point multiplication instructions emitted for compute capability 1.x targets. On compute 2.x and 3.x targets, it puts the compiler into a compatibility mode and all single precision multiplication instructions will be mul.ftz.f32
(flush to zero).
The floating point intrinics you mention (__fmul_{rm,rn,rp,rz,ftz,sat}
) only provide explicit control the IEEE rounding behaviour. I don't believe there is a throughput difference between any of them on Fermi or Kepler GPUs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With