From Nvidia release notes:
The nvcc compiler switch, --fmad (short name: -fmad), to control the contraction of
floating-point multiplies and add/subtracts into floating-point multiply-add
operations (FMAD, FFMA, or DFMA) has been added:
--fmad=true and --fmad=false enables and disables the contraction respectively.
This switch is supported only when the --gpu-architecture option is set with
compute_20, sm_20, or higher. For other architecture classes, the contraction is
always enabled.
The --use_fast_math option implies --fmad=true, and enables the contraction.
I have two kernels - one is purely compute bound with lots of multiplications, whereas the other one is memory bound. I notice a consistent improvement in performance (around 5%) for my compute intensive kernel when I do -fmad=false
...and around the same percent decline in performance when I turn it off for my memory bound kernel.
So, FMA is working better for my memory bound kernel, but my compute bound kernel could squeeze a little performance by turning it off.
What could be the reason?
My device is M2090 and I am using CUDA 4.2.
Full compilation options:
-arch,sm_20,-ftz=true,-prec-div=false,-prec-sqrt=false,-use_fast_math,-fmad=false
(or I just remove fmad=false
because that's the default anyway.
Use of FMA may increase register pressure slightly, because three source operands must be available at the same time. So turning FMA generation on / off can lead to small differences in instruction scheduling and register allocation, which in turn can lead to small performance differences. For a compute-bound kernel with many multiply-add idioms, -fmad=true should make a significant performance difference, but as you say, your kernel is dominated by multiplies and thus will benefit little from use of FMA, and any gains may be offset by the register pressure / instruction scheduling aspects
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With