Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check boundary of array in CUDA Kernel without branch divergence

Tags:

cuda

In following kernel, I used if statement to avoid out-of-range calculation. But if I understand correctly, the `if' statement will cause branch divergence that will slow down the computation - please correct me if I am wrong here.

My question: How can I avoid the if statement while handing out-of-range calculation in the kernel?

__global__ void vector_add(float *a, float *b, float *c)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index < N)
          c[index] = a[index]*a[index] + b[index]*b[index];
}
//kernel call here
vector_add<<< (N + (THREADS_PER_BLOCK+1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );
like image 485
Tae-Sung Shin Avatar asked Mar 18 '23 01:03

Tae-Sung Shin


1 Answers

While it is technically called a "divergence" (because not all threads within a warp evaluate the condition identically), it is completely harmless.

The threads that do not evaluate the predicate to true will simply get disabled: it's not a performance issue, as those threads are not expected to take part in the computation anyway. You are not losing any actual work thread. In the pathological case where N is congruent to 1 mod 32 (or whatever the warp size is), there is simply a warp that gets almost completely "wasted", but again, it is not a performance issue.

Warp divergence hurts your performance when threads within a warp take different paths that need to be executed serially. This is not the case here.

like image 163
user703016 Avatar answered Apr 26 '23 08:04

user703016