CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

Tags:

We have a workstation with two Nvidia Quadro FX 5800 cards installed. Running the deviceQuery CUDA sample reveals that the maximum threads per multiprocessor (SM) is 1024, while the maximum threads per block is 512.

Given that only one block can be executed on each SM at a time, why is max threads / processor double the max threads / block? How do we utilise the other 512 threads per SM?

Device 1: "Quadro FX 5800"
  CUDA Driver Version / Runtime Version          5.0 / 5.0
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  (30) Multiprocessors x (  8) CUDA Cores/MP:    240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              512-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Cheers, James.

918

asked Jul 23 '13 16:07

James Paul Turner

1 Answers

Given that only one block can be executed on each SM at a time,

This statement is fundamentally incorrect. Barring resource conflicts, and assuming enough threadblocks in a kernel (i.e. the grid), an SM will generally have multiple threadblocks assigned to it.

The basic unit of execution is the warp. A warp consists of 32 threads, executed together in lockstep by an SM, on an instruction-cycle by instruction-cycle basis.

Therefore, even within a single threadblock, an SM will generally have more than a single warp "in flight". This is essential for good performance to allow the machine to hide latency.

There is no conceptual difference between choosing warps from the same threadblock to execute, or warps from different threadblocks. SMs can have multiple threadblocks resident on them (i.e. with resources such as registers and shared memory assigned to each resident threadblock), and the warp scheduler will choose from amongst all the warps in all the resident threadblocks, to select the next warp for execution on any given instruction cycle.

Therefore, the SM has a greater number of threads that can be "resident" because it can support more than a single block, even if that block is maximally configured with threads (512, in this case). We utilize more than the threadblock limit by having multiple threadblocks resident.

You may also want to research the idea of occupancy in GPU programs.

117

answered Oct 04 '22 01:10

Robert Crovella

Related questions
                            
                                Atomic Operations in CUDA? Which header file to include?
                            
                                Amdahl's law and GPU
                            
                                Cuda Clang and OS X Mavericks
                            
                                SVD speed in CPU and GPU
                            
                                All CUDA devices are used for display: Can not debug my CUDA-code from within desktop environment
                            
                                CUDA function call-able by either the device or host
                            
                                is there a better and a faster way to copy from CPU memory to GPU using thrust?
                            
                                CUDA coalesced access to global memory
                            
                                CUDA5 Examples: Has anyone translated some cutil definitions to CUDA5?
                            
                                Is CUDA pinned memory zero-copy?
                            
                                How to list CUDA devices in windows 7 using cmd?
                            
                                Use of unique_ptr and cudaMalloc
                            
                                Parameters to CUDA kernels
                            
                                About cudaMemcpyAsync Function
                            
                                What is the OpenCL analogue for CUDA's __syncthreads() and blockIdx.x?
                            
                                Implementing Max Reduce in Cuda
                            
                                CUDA and pinned (page locked) memory not page locked at all?
                            
                                Why cuSparse is much slower than cuBlas for sparse matrix multiplication
                            
                                Large matrix multiplication on gpu
                            
                                GPU Programming?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]

Tags:

cuda

gpgpu

gpu

nvidia

James Paul Turner

People also ask

1 Answers

Robert Crovella

Recent Activity

Donate For Us