Will 32 threads from 32 block be scheduled as a warp?

Tags:

cuda

I understand that in CUDA, 32 adjacent threads in the same block will be scheduled as a warp. But I frequently finds some tutorial CUDA codes that has multiple blocks with 1 thread per block. In this model, will 32 threads from 32 block be scheduled as a warp? If not, can I say this model is not as efficient as organizing into 32-threads per block? Thanks!

595

asked Dec 04 '12 02:12

Hailiang Zhang

1 Answers

No, threads from different blocks cannot be scheduled in the same warp. If you create grids of threadblocks with only a single thread, you're definitely not getting the full performance from the machine. It's less efficient than having 32 (or an integer multiple of 32) threads per block. A Fermi SM, for example has 32 warp lanes that can be in use. If you are scheduling blocks of a single thread, then only 1 of those 32 lanes can be in use at any given time.

Threads have a thread ID (threadIdx built-in variable) which is defined within (and unique only to) a single block.

The Hardware multithreading section of the C programming guide gives a formula which defines the total number of warps in a single block.

146

answered Sep 20 '22 13:09

Robert Crovella

Related questions
                            
                                Is there any way or even possible to get the overall utilization of a GPU during a period of time?
                            
                                CUDA device pointers
                            
                                Update a D3D9 texture from CUDA
                            
                                nVidia Thrust: device_ptr Const-Correctness
                            
                                NSight attach shows no available processes
                            
                                Profiling MATLAB mex CUDA applications with the NVIDIA visual profiler
                            
                                How can I use TensorFlow without CUDA on Linux?
                            
                                Thread synchronization with syncwarp
                            
                                Idiom for CUDA class static member in device code?
                            
                                How to use CUDA pinned "zero-copy" memory for a memory mapped file?
                            
                                question about modifing flag array in cuda
                            
                                How to integrate CUDA .cu code with C++ app
                            
                                CUDA finding the max value in given array
                            
                                CUDA: Getting max value and its index in an array
                            
                                OpenCV CUDA running slower than OpenCV CPU
                            
                                CUDA C# .Net Example Project? VS2010
                            
                                Removing __syncthreads() in CUDA warp-level reduction
                            
                                Depth-first search in CUDA / OpenCL
                            
                                The cost of CUDA global memory transactions
                            
                                How to disable a specific nvcc compiler warnings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With