Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with large switch statements in CUDA

I understand that branching in CUDA is not recommended as it can adversely affect performance. In my work, I find myself having to implement large switch statements that contain upward of a few dozen cases.

Does anyone have any idea how badly this will affect performance. (The official documentation isn't very specific) Also does anyone have a more efficient way of handling this portion?

like image 324
gamerx Avatar asked Jun 25 '12 08:06

gamerx


People also ask

Should switch statements be avoided?

IMO switch statements are not bad, but should be avoided if possible. One solution would be to use a Map where the keys are the commands, and the values Command objects with an execute() method. Or a List if your commands are numeric and have no gaps.

How can I make my CUDA code faster?

Write simple and small kernels. Kernel launch cost is negligible( 5 us). If you have one large Kernel, try to split it up into multiple small ones – it might be faster due to less registers used. For small kerenels we get much resources(registers, shared memory,constant memory, etc.)

Are switch case statements faster?

A switch statement is significantly faster than an if-else ladder if there are many nested if-else's involved. This is due to the creation of a jump table for switch during compilation. As a result, instead of checking which case is satisfied throughout execution, it just decides which case must be completed.

What is function of __ global __ qualifier in CUDA program?

__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).


2 Answers

The GPU runs threads in groups of 32, called warps. Whenever different threads in a warp go through different paths in the code, the GPU has to run the entire warp multiple times, once for each code path.

To deal with this issue, called warp divergence, you want to arrange your threads so that the threads in a given warp go through as few different code paths as possible. When you have done that, you pretty much just have to bite the bullet and accept the loss in performance caused by any remaining warp divergence. In some cases, there might not be anything you can do to arrange your threads. If so, and if the different code paths are a big part of your kernel or overall workload, the task may not be a good fit for the GPU.

It doesn't matter how you implement the different code paths. if-else, switch, predication (in PTX or SASS), branch tables or anything else -- if it comes down to the threads in a warp running in different paths, you get a hit on performance.

It also doesn't matter how many threads go through each path, just the total number of different paths in the warp.

Here is another answer on this that goes into a bit more detail.

like image 146
Roger Dahl Avatar answered Sep 27 '22 22:09

Roger Dahl


A good way to avoid multiple switches is to implement function table and select function from table by index based in you switch condition. CUDA allows you to use function pointers on __device__ function in kernels.

like image 31
geek Avatar answered Sep 27 '22 22:09

geek