Normally it is advised to lower the per thread register pressure to increase warp occupancy thereby providing greater opportunity to hide latency through warp level multi-threading (TLP). To decrease the register pressure, one would use more per thread local memory or per thread block shared memory. CUDA nvcc compiler can also be forced to use less registers per thread. This approach is useful for workloads which have good arithmetic latency i.e. Ratio of ALU operations to memory r/w access requests is high. However for latency critical applications where there is very little computation and there are more frequent memory access, this approach tends to actually lower performance.
In case of such latency critical applications, it makes more sense to bring as much data as possible in the on-chip registers or shared memory, and then use it as much as possible before replacing it with next chunk of data from global memory. Of course by increasing register pressure, the warp occupancy decreases but now we are hiding off-chip memory latency using fast on-chip registers. The way to increase per thread register usage is to increase the ILP by unrolling loops or to calculate more output data per thread(this also increases ILP basically by doing same work on more inputs). This approach was basically suggested by Volkov(Better Performance at Lower Occupancy).
Now nvcc compiler driver has a command line option called maxrregcount which allows one to change the per thread register usage. With this option once can force compiler to decrease the per thread register usage but cannot force it increase it. I have a case where I want to increase per thread register usage but I cannot unroll loops inside my kernel as loop bounds are data dependent and dynamic. So far I have tried few tricks, but I have run out of ideas on how to increase per thread register usage. Can anyone suggest ways to increase register usage of a single CUDA thread ?
To some extent, this question duplicates Forcing CUDA to use register for a variable. You have summarized the options pretty well. If you can't force register usage via unrolling and explicit scalar variable usage, then I think you may be stuck.
Note that even loops with dynamic bounds can be partially hand-unrolled. You just have to check the bounds within the unrolled parts of the loop. This may help increase register usage.
I also think that there is not a guaranteed direct relationship between increasing register usage and decreasing latency, so really you should focus on decreasing latency, not particularly on register usage.
If you want to decrease overall kernel latency, then there are some things you should try.
Interesting problem! I'm trying this method of using ILP to give better performance too! And in fact, because I'm constraint by older architecture of GPU with lesser registers allocated per thread, using ILP actually improves the performance as it frees up the registers for more computational work through loop unrolling (independent instructions)!
I wonder how many nested loop do you have? If the inner loop cannot be unroll, probably go up a level and look for opportunities?
To increase registers usage per thread, have you reduced the number of blocks launched (with lesser threads)?
To increase the usage of register/thread, load more than 1 set of data to perform in parallel.
Is it independent in each iteration of the loop? I believe the key thing is to look for independent computations. How about performing in batches. Say loop count is N, split it into N/M and omcpute them independently?
It's hard to give suggestions when you give little clue :P
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With