CUDA: How many concurrent threads in total?

Tags:

gpgpu

I have a GeForce GTX 580, and I want to make a statement about the total number of threads that can (ideally) actually be run in parallel, to compare with 2 or 4 multi-core CPU's.

deviceQuery gives me the following possibly relevant information:

CUDA Capability Major/Minor version number:    2.0
(16) Multiprocessors x (32) CUDA Cores/MP:     512 CUDA 
Maximum number of threads per block:           1024

I think I heard that each CUDA core can run a warp in parallel, and that a warp is 32 threads. Would it be correct to say that the card can run 512*32 = 16384 threads in parallel then, or am I way off and the CUDA cores are somehow not really running in parallel?

631

asked Jun 27 '11 08:06

Eskil

2 Answers

The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads.

Don't confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored on-chip simultaneously -- the number that can be resident. In CUDA terms we also call this maximum occupancy. The hardware switches between warps constantly to help cover or "hide" the (large) latency of memory accesses as well as the (small) latency of arithmetic pipelines.

While each SM can have 48 resident warps, it can only issue instructions from a small number (on average between 1 and 2 for GTX 580, but it depends on the program instruction mix) of warps at each clock cycle.

So you are probably better off comparing throughput, which is determined by the available execution units and how the hardware is capable of performing multi-issue. On GTX580, there are 512 FMA execution units, but also integer units, special function units, memory instruction units, etc, which can be dual-issued to (i.e. issue independent instructions from 2 warps simultaneously) in various combinations.

Taking into account all of the above is too difficult, though, so most people compare on two metrics:

Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
Measured throughput on the application you are interested in.

The most important comparison is always measured wall-clock time on a real application.

125

answered Nov 08 '22 11:11

harrism

There are certain traps that you can fall into by doing that comparison to 2 or 4-core CPUs:

The number of concurrent threads does not match the number of threads that actually run in parallel. Of course you can launch 24576 threads concurrently on GTX 580 but the optimal value is in most cases lower.
A 2 or 4-core CPU can have arbitrary many concurrent threads! Similarly as with GPU, from some point adding more threads won't help, or even it may slow down.
A "CUDA core" is a single scalar processing unit, while CPU core is usually a bigger thing, containing for example a 4-wide SIMD unit. To compare apples-to-apples, you should multiply the number of advertised CPU cores by 4 to match what NVIDIA calls a core.
CPU supports hyperthreading, which allows a single core to process 2 threads concurrently in a light way. Because of that, an operating system may actually see 2 times more "logical cores" than the hardware cores.

To sum it up: For a fair comparison, your 4-core CPU can actually run 32 "scalar threads" concurrently, because of SIMD and hyperthreading.

answered Nov 08 '22 10:11

CygnusX1

Related questions
                            
                                Can I program Nvidia's CUDA using only Python or do I have to learn C?
                            
                                Setting up Visual Studio Intellisense for CUDA kernel calls
                            
                                cuda block synchronization
                            
                                Default Pinned Memory Vs Zero-Copy Memory
                            
                                Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h
                            
                                Thrust inside user written kernels
                            
                                What is the purpose of using multiple "arch" flags in Nvidia's NVCC compiler?
                            
                                CUDA and Classes
                            
                                What's the difference between CUDA shared and global memory?
                            
                                allocating shared memory
                            
                                CUDA: How to use -arch and -code and SM vs COMPUTE
                            
                                Can I use __syncthreads() after having dropped threads?
                            
                                Running more than one CUDA applications on one GPU
                            
                                How to install CUDA in Google Colab GPU's
                            
                                High level GPU programming in C++ [closed]
                            
                                Can/Should I run this code of a statistical application on a GPU?
                            
                                Using std::vector in CUDA device code
                            
                                CUDA Driver API vs. CUDA runtime
                            
                                Why does cudaMalloc() use pointer to pointer?
                            
                                ImportError: libcublas.so.9.0: cannot open shared object file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With