Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Divergence in CUDA - exit from a thread in kernel

I'm wondering how can I exit from a thread, whose thread index is to big. I see two possibilities:

int i = threadIdx.x;
if(i >= count)
    return;
// do logic

or

int i = threadIdx.x;
if(i < count) {
    // do logic
}

I know, that both are correct, but which one affect more the performance?

like image 466
Tomasz Dzięcielewski Avatar asked Feb 14 '13 07:02

Tomasz Dzięcielewski


People also ask

What is thread divergence in CUDA?

Warp divergence occurs when two threads of the same warp diverge in their execution due to a branch instruction, where one thread branches and the other does not. This leads to serialization of the two threads by the CUDA hardware until their execution path converges again.

What technique does the GPU use if the execution of threads within a warp diverges?

NVIDIA GPUs execute warps of 32 parallel threads using SIMT, which enables each thread to access its own registers, to load and store from divergent addresses, and to follow divergent control flow paths.

How are threads numbered in CUDA?

Each CUDA card has a maximum number of threads in a block (512, 1024, or 2048). Each thread also has a thread id: threadId = x + y Dx + z Dx Dy The threadId is like 1D representation of an array in memory. If you are working with 1D vectors, then Dy and Dz could be zero. Then threadIdx is x, and threadId is x.


1 Answers

Although both are the same in terms of performance, you should take into account that the first one is not recommended.

Return a thread within a kernel could cause an unexpected behaviour in the rest of your code.

By unexpected behaviour I mean whatever problem related to the minimum unit of threads that are grouped in a warp. In example if you have an if / else block in your kernel, this situation is known as thread divergence and in a normal case it results in threads remaining idle and others executing some instructions.

CUDA by Example Book, Chapter 5, Thread Cooperation:

But in the case of __syncthreads(), the result is somewhat tragic. The CUDA Architecture guarantees that no thread will advance to an instruction beyond the __syncthreads() until every thread in the block has executed the __syncthreads()

So, it is mainly related to the threads synchronization within a kernel. You can find a very good question / answer about this topic here: Can I use __syncthreads() after having dropped threads?

As I final note, I've also used that bad practice and no problem appeared but there is no guarantee that problems may arise in the future. It is something that I would not recommend

like image 69
pQB Avatar answered Nov 08 '22 21:11

pQB