Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cuda control divergence

Tags:

cuda

say I have 3 share memory array: a b c. I am not sure if following thread arrangement will cause control divergence or not,

if (threadIdx < 64)
{
    if (threadIdx == 1)
        for (int i = 0; i < N; i++)
            c += a[threadIdx]*a[threadIdx];
    else
        for (int i = 0; i < N; i++)
            c += a[threadIdx]*b[threadIdx];
}

if it does, how bad is it gonna affect performance? is there any efficient way to handle the problem? thanks

like image 927
small_potato Avatar asked Dec 01 '22 04:12

small_potato


1 Answers

Depending on the dimensions of your block the first condition threadIdx.x < 64 (note the .x) may not cause any divergence at all. For example, if you have a block with dimensions (128,1,1) then the first two warps (32-threads groups which execute in lock-step) will enter into the if block while the last two will bypass it. Since the whole warp goes one way or the other there is no divergence.

A conditional like threadIdx.x == 1 will cause divergence, but it will have very modest cost. Indeed, in many cases CUDA will be able to implement the conditional expression with a single instruction. For instance, operations like min, max, and abs will generally be implemented with a single instruction and cause absolutely no divergence. You can read about such instructions in the PTX Manual.

In general you should not be overly concerned about modest amounts of control-flow divergence like the above. Where divergence will bite you in in situations like

if (threadIdx.x % 4 == 0)
  // do expensive operation
else if (threadIdx.x % 4 == 1)
  // do expensive operation
else if (threadIdx.x % 4 == 2)
  // do expensive operation
else
  // do expensive operation

where an "expensive operation" would be one that required 10s or 100s of instructions. In this case the divergence caused by the if statements would reduce efficiency by 75%.

Keep in mind that thread divergence is a much lesser concern than (1) high-level algorithm choices and (2) memory locality/coalescing. Very few CUDA programmers should ever be concerned with the sort of divergence in your examples.

like image 141
wnbell Avatar answered Dec 05 '22 08:12

wnbell