I have many class that written in .h and .cu, so I tried the relocatable device code(-rdc=true). It cost about 12 seconds. Then I tried to combine the code, use header only classes and remove the -rdc=true, it took only 2 seconds.
What the code does is sha1(some string) 0x40000 times, which is used in winrar encryption.
Why is that? It's ok for now, but my project will become larger and separate compilation would be useful. Is it normal behavior that -rdc=true can slow down the performance?
If the code of a fuction is located in a separate translation unit, that is not in a header of the entry-point you are calling, then, no inlining may occur. In this case, function call will be more expensive. You might want to relocate your time-critical functions in a header file with inline keyword so that compiler has opportunity to inline.
Separate compilation might yield to use of local address space for parameters (see http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#abstracting-abi for parameter passing) which is much more expensive than registers as this table shows (http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#operand-costs).
Moving some methods from your class implementation file into the header file with the inline keyword to avoid linking issues might be a solution.
It could be possible that separate compilation cause this slowdown. The compilers may not have enough info to apply certain optimizations (all link time info are missing). Apparently the nvcc still does not incorporate those optimization at link stage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With