I understand that branching in CUDA is not recommended as it can adversely affect performance. In my work, I find myself having to implement large switch statements that contain upward of a few dozen cases.
Does anyone have any idea how badly this will affect performance. (The official documentation isn't very specific) Also does anyone have a more efficient way of handling this portion?
IMO switch statements are not bad, but should be avoided if possible. One solution would be to use a Map where the keys are the commands, and the values Command objects with an execute() method. Or a List if your commands are numeric and have no gaps.
Write simple and small kernels. Kernel launch cost is negligible( 5 us). If you have one large Kernel, try to split it up into multiple small ones – it might be faster due to less registers used. For small kerenels we get much resources(registers, shared memory,constant memory, etc.)
A switch statement is significantly faster than an if-else ladder if there are many nested if-else's involved. This is due to the creation of a jump table for switch during compilation. As a result, instead of checking which case is satisfied throughout execution, it just decides which case must be completed.
__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).
The GPU runs threads in groups of 32, called warps. Whenever different threads in a warp go through different paths in the code, the GPU has to run the entire warp multiple times, once for each code path.
To deal with this issue, called warp divergence, you want to arrange your threads so that the threads in a given warp go through as few different code paths as possible. When you have done that, you pretty much just have to bite the bullet and accept the loss in performance caused by any remaining warp divergence. In some cases, there might not be anything you can do to arrange your threads. If so, and if the different code paths are a big part of your kernel or overall workload, the task may not be a good fit for the GPU.
It doesn't matter how you implement the different code paths. if-else
, switch
, predication (in PTX or SASS), branch tables or anything else -- if it comes down to the threads in a warp running in different paths, you get a hit on performance.
It also doesn't matter how many threads go through each path, just the total number of different paths in the warp.
Here is another answer on this that goes into a bit more detail.
A good way to avoid multiple switches is to implement function table and select function from table by index based in you switch condition. CUDA allows you to use function pointers on __device__
function in kernels.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With