I'm wondering how can I exit from a thread, whose thread index is to big. I see two possibilities: <pre class="prettyprint"><code>int i = threadIdx.x; if(i >= count) return; // do logic </code></pre> or <pre class="prettyprint"><code>int i = threadIdx.x; if(i < count) { // do logic } </code></pre> I know, that both are correct, but which one affect more the performance?

Although both are the same in terms of performance, you should take into account that the first one is not recommended. Return a thread within a kernel could cause an unexpected behaviour in the rest of your code. By unexpected behaviour I mean whatever problem related to the minimum unit of threads that are grouped in a warp. In example if you have an <code>if / else</code> block in your kernel, this situation is known as thread divergence and in a normal case it results in threads remaining idle and others executing some instructions. CUDA by Example Book, Chapter 5, Thread Cooperation: <blockquote> But in the case of __syncthreads(), the result is somewhat tragic. The CUDA Architecture guarantees that no thread will advance to an instruction beyond the __syncthreads() until every thread in the block has executed the __syncthreads() </blockquote> So, it is mainly related to the threads synchronization within a kernel. You can find a very good question / answer about this topic here: Can I use __syncthreads() after having dropped threads? As I final note, I've also used that bad practice and no problem appeared but there is no guarantee that problems may arise in the future. It is something that I would not recommend

Divergence in CUDA - exit from a thread in kernel

Tags:

performance

cuda

gpgpu

nvidia

I'm wondering how can I exit from a thread, whose thread index is to big. I see two possibilities:

int i = threadIdx.x;
if(i >= count)
    return;
// do logic

int i = threadIdx.x;
if(i < count) {
    // do logic
}

I know, that both are correct, but which one affect more the performance?

466

asked Feb 14 '13 07:02

Tomasz Dzięcielewski

1 Answers

Although both are the same in terms of performance, you should take into account that the first one is not recommended.

Return a thread within a kernel could cause an unexpected behaviour in the rest of your code.

By unexpected behaviour I mean whatever problem related to the minimum unit of threads that are grouped in a warp. In example if you have an if / else block in your kernel, this situation is known as thread divergence and in a normal case it results in threads remaining idle and others executing some instructions.

CUDA by Example Book, Chapter 5, Thread Cooperation:

But in the case of __syncthreads(), the result is somewhat tragic. The CUDA Architecture guarantees that no thread will advance to an instruction beyond the __syncthreads() until every thread in the block has executed the __syncthreads()

So, it is mainly related to the threads synchronization within a kernel. You can find a very good question / answer about this topic here: Can I use __syncthreads() after having dropped threads?

As I final note, I've also used that bad practice and no problem appeared but there is no guarantee that problems may arise in the future. It is something that I would not recommend

answered Nov 08 '22 21:11

pQB

Related questions
                            
                                Updating massive number of records -- performance optimization
                            
                                Function executes faster without STRICT modifier?
                            
                                Delphi 6 application running slow on windows 7
                            
                                Tuning MVC3 application that uses jquery?
                            
                                Recommendations on how to evaluate openCV with Intel's Integrated Performance Primitives?
                            
                                What perfmon counters are useful for identifying ASP.NET bottlenecks?
                            
                                PostgreSQL query with smaller date range (result set) slower then one with bigger date range (result)
                            
                                optimise performance for minesweeper-style game silverlight
                            
                                SQL query multiple ranges without using multiple OR clauses (nesting LIKE, BETWEEN)
                            
                                Text Columns should be moved towards end?
                            
                                Multi-tenant SQL Server databases and parameter sniffing
                            
                                Performance issues with TransferRequestHandler and BeginRequest
                            
                                How to optimize large size for loop
                            
                                Efficiency in C++ [closed]
                            
                                Is there a way to tell from within the JVM whether a particular method has been JIT compiled?
                            
                                Java MappedByteBuffer.get() surprisingly slow
                            
                                Python: efficient way to measure region properties using shapely
                            
                                Better performance when generating random array int[]
                            
                                Calculation of centroid & volume of a polyhedron when the vertices are given
                            
                                Slow query ColdFusion, SQL Server, whitespace dependent

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With