Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why cuda code runs much slower when -rdc=true is specified

Tags:

c++

cuda

I have many class that written in .h and .cu, so I tried the relocatable device code(-rdc=true). It cost about 12 seconds. Then I tried to combine the code, use header only classes and remove the -rdc=true, it took only 2 seconds.

What the code does is sha1(some string) 0x40000 times, which is used in winrar encryption.

Why is that? It's ok for now, but my project will become larger and separate compilation would be useful. Is it normal behavior that -rdc=true can slow down the performance?

like image 759
aj3423 Avatar asked Mar 12 '23 16:03

aj3423


2 Answers

If the code of a fuction is located in a separate translation unit, that is not in a header of the entry-point you are calling, then, no inlining may occur. In this case, function call will be more expensive. You might want to relocate your time-critical functions in a header file with inline keyword so that compiler has opportunity to inline.

Separate compilation might yield to use of local address space for parameters (see http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#abstracting-abi for parameter passing) which is much more expensive than registers as this table shows (http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#operand-costs).

Moving some methods from your class implementation file into the header file with the inline keyword to avoid linking issues might be a solution.

like image 81
Florent DUGUET Avatar answered Mar 24 '23 14:03

Florent DUGUET


It could be possible that separate compilation cause this slowdown. The compilers may not have enough info to apply certain optimizations (all link time info are missing). Apparently the nvcc still does not incorporate those optimization at link stage.

like image 43
Davide Spataro Avatar answered Mar 24 '23 16:03

Davide Spataro