In following kernel, I used if
statement to avoid out-of-range calculation. But if I understand correctly, the `if' statement will cause branch divergence that will slow down the computation - please correct me if I am wrong here.
My question: How can I avoid the if statement while handing out-of-range calculation in the kernel?
__global__ void vector_add(float *a, float *b, float *c)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if(index < N)
c[index] = a[index]*a[index] + b[index]*b[index];
}
//kernel call here
vector_add<<< (N + (THREADS_PER_BLOCK+1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );
While it is technically called a "divergence" (because not all threads within a warp evaluate the condition identically), it is completely harmless.
The threads that do not evaluate the predicate to true
will simply get disabled: it's not a performance issue, as those threads are not expected to take part in the computation anyway. You are not losing any actual work thread. In the pathological case where N
is congruent to 1 mod 32
(or whatever the warp size is), there is simply a warp that gets almost completely "wasted", but again, it is not a performance issue.
Warp divergence hurts your performance when threads within a warp take different paths that need to be executed serially. This is not the case here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With