Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does -use-fast-math option translate SP multiplications to intrinsics?

I had a quick glance of the CUDA Programming guide w.r.t -use-fast-math optimizations, and although appendix C mention divisions to be converted to an intrinsic but there are no mention of multiplications. The reason I ask this question is, my kernel has a lot of multiplications. I am aware that NVCC would try to fuse multiplications and additions (when regular '*' and '+' operators are used, and that intrinsics are never merged into FMAD operations). But if my code is multiplication heavy, then would there be a benefit if rounding-off SP intrinsic like __fmul_rn is used?

So there are two questions:

  1. Does -use-fast-math option translate multiplications with '*' operator to SP instrinsics like __fmul_rn?

  2. Could there be a performance benefit in hand-coding multiplications to explicitly use __fmul_rn? An example or some numbers would help me understand.

like image 473
Sayan Avatar asked Oct 22 '22 16:10

Sayan


1 Answers

"Standalone" single precision multiplication always compiles to hardware instructions ("intrinsics"). There is no other type of floating point multiplication instructions. The -use_fast_math option in nvcc has no effect on the floating point multiplication instructions emitted for compute capability 1.x targets. On compute 2.x and 3.x targets, it puts the compiler into a compatibility mode and all single precision multiplication instructions will be mul.ftz.f32 (flush to zero).

The floating point intrinics you mention (__fmul_{rm,rn,rp,rz,ftz,sat}) only provide explicit control the IEEE rounding behaviour. I don't believe there is a throughput difference between any of them on Fermi or Kepler GPUs.

like image 51
talonmies Avatar answered Oct 30 '22 20:10

talonmies