My CUDA program uses only float
, int
, short
and char
types in its computation. None of the input or output arrays have members of type double
. And none of the kernels create any double
type inside them during computation.
This program has been compiled using CUDA SDK 5.5 in Release mode using NSight Eclipse. A typical compile line looks like this:
nvcc -O3 -gencode arch=compute_35,code=sm_35 -M -o "src/foo.d" "../src/foo.cu"
I am running this program on a GTX Titan on Linux. To my surprise, I noticed that this program runs 10% faster when I enable the full speed FP64 mode on Titan. This can be done by enabling CUDA Double Precision option in NVIDIA X Server Settings program.
While I am happy for this free speed bonus, I would like to learn the reasons why a CUDA float program could get faster in FP64 mode?
I guess that when you enable the full speed FP64 mode on Titan, more compute units start participating in computation and these FP64 compute units can be used to computing FP32. But enabling large amount of FP64 blocks also slowing clock, so computing getting faster by only 10%.
How to get 10%? When Titan runs in 1/24 FP64 mode, it runs at 837MHz. When it runs in 1/3 FP64 mode, it runs at 725MHz. So (1+1/3)/(1+1/24) * 725/837 = 1.109.
References: http://www.anandtech.com/show/6760/nvidias-geforce-gtx-titan-part-1/4
I found confirmation my guess.
"What's more, the CUDA FP64 block has a very special execution rate: 1/1 FP32."
Reference http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2
This information for GK104, Titan have GK110. But it's one architecture. So I think that GK110 also have this opportunity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With