Why launch a multiple of 32 number of threads in CUDA?

Tags:

I took a course in CUDA parallel programming and I have seen many examples of CUDA thread configuration where it is common to round up the number of threads needed to the closest multiple of 32. I understand that threads are grouped into warps, and that if you launch 1000 threads, the GPU will round it up to 1024 anyways, so why do it explicitly then?

827

asked Oct 28 '14 14:10

Michael

1 Answers

The advice is generally given in the context of situations where you might conceivably choose various threadblock sizes to solve the same problem.

Let's take vector add as an example. Suppose my vector is of length 100000. I might choose to do this by launching 100 blocks of 1000 threads each. In this case, each block will have 1000 active threads, and 24 inactive threads. My average utilization of thread resources is 1000/1024 = 97.6%.

Now, what if I chose blocks of size 1024? Now I only need to launch 98 blocks. The first 97 of these blocks are fully utilized in terms of thread utilization - every thread is doing some thing useful. The 98th block only has 672 (out of 1024) threads that are doing something useful. The others are explicitly inactive because of a thread check (if (idx < N) ) or other construct in the kernel code. So I have 352 inactive threads in that one block. But my overall average utilization is 100000/100352 = 99.6%

So in the above scenario, it's better to choose "full" threadblocks, evenly divisible by 32.

If you are doing vector add on a vector of length 1000, and you intend to do it in a single threadblock, (both may be bad ideas), then it does not matter whether you choose 1000 or 1024 for your threadblock size.

134

answered Sep 17 '22 12:09

Robert Crovella

Related questions
                            
                                Parallelism on divide & conquer algorithm
                            
                                Parallel Matrix Multiplication in Java 6
                            
                                how to communicate two separate python processes?
                            
                                Can we specify degree of parallelism dynamically?
                            
                                OpenCV CUDA running slower than OpenCV CPU
                            
                                how to apply parallelism-programming in graph problems?
                            
                                Concurrent Collections and Unique elements
                            
                                Parallelizing an algorithm with many exit points?
                            
                                Plinq's Range partitioning vs Chunk partitioning?
                            
                                Depth-first search in CUDA / OpenCL
                            
                                Easy parallel evaluation of tuples in scala?
                            
                                Accelerate programme with multiple processors
                            
                                MATLAB: How to set random seed in parfor to produce same results as serial for?
                            
                                Parallel tree search
                            
                                Python multiprocessing map function error
                            
                                Executing GNU Parallel within a script
                            
                                Difference between single and sections directive in OpenMP
                            
                                Parallelizing a very tight loop
                            
                                Nested Java 8 parallel forEach loop perform poor. Is this behavior expected?
                            
                                How do I parallelize a for loop through a C++ std::list using OpenMP?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why launch a multiple of 32 number of threads in CUDA?

Tags:

parallel-processing

cuda

Michael

People also ask

1 Answers

Robert Crovella

Recent Activity

Donate For Us