Is worth replacing all multiplications with the __umul24 function in a CUDA kernel? I read different and opposite opinions and I can't still make a bechmark to figure it out
Only in devices with architecture prior to fermi, that is with cuda capabilities prior to 2.0 where the integer arithmetic unit is 24 bit.
On Cuda Device with capabilities >= 2.0 the architecture is 32 bit the _umul24 will be slower instead of faster. The reason is because it has to emulate the 24 bit operation with 32 bit architecture.
The question is now: Is it worth the effort for the speed gain ? Probably not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With