Minimum number of GPU threads to be effective

Tags:

I'm going to parallelize on CUDA a local search algorithm for some optimization problem. The problem is very hard, so the size of the practically solvable problems is quite small. My concern is that the number of threads planned to run in one kernel is insufficient to obtain any speedup on GPU (even assuming all threads are coalesced, free of bank conflicts, non-branching etc.). Let's say a kernel is launched for 100 threads. Is it reasonable to expect any profit from using GPU? What if the number of threads is 1000? What additional information is needed to analyze the case?

410

asked Aug 11 '11 17:08

AdelNick

1 Answers

100 threads is not really enough. Ideally you want a size that can be divided in to at least as many thread blocks as there are multiprocessors (SMs) on the GPU, otherwise you will be leaving processors idle. Each thread block should have no fewer than 32 threads, for the same reason. Ideally, you should have a small multiple of 32 threads per block (say 96-512 threads), and if possible, multiple of these blocks per SM.

At a minimum, you should try to have enough threads to cover the arithmetic latency of the SMs, which means that on a Compute Capability 2.0 GPU, you need about 10-16 warps (groups of 32 threads) per SM. They don't all need to come from the same thread block, though. So that means, for example, on a Tesla M2050 GPU with 14 SMs, you would need at least 4480 threads, divided into at least 14 blocks.

That said, fewer threads than this could also provide a speedup -- it depends on many factors. If the computation is bandwidth bound, for example, and you can keep the data in device memory, then you could get a speedup because GPU device memory bandwidth is higher than CPU memory bandwidth. Or, if it is compute bound, and there is a lot of instruction-level parallelism (independent instructions from the same thread), then you won't need as many threads to hide latency. This latter point is described very well in Vladimir Volkov's "Better performance at lower occupancy" talk from GTC 2010.

The main thing is to make sure you use all of the SMs: without doing so you aren't using all of the computation performance or bandwidth the GPU can provide.

answered Nov 19 '22 22:11

harrism

Related questions
                            
                                How does the opencl command queue work, and what can I ask of it
                            
                                How to decrement each element of a device_vector by a constant?
                            
                                Poor performance for calculating eigenvalues and eigenvectors on GPU
                            
                                Should CUDA events and streams always be destroyed?
                            
                                Determinant calculation with CUDA [closed]
                            
                                cryptography hardware acceleration with GPU
                            
                                How to make vector-type-value to pinned memory in cuda
                            
                                Cuda virtual class
                            
                                How do I get a free version (non-trial) of the compiler "Cuda Fortran"? [closed]
                            
                                Passing the PTX program to the CUDA driver directly
                            
                                Redirecting CUDA printf to a C++ stream
                            
                                Numba Matrix Vector multiplication
                            
                                Cudafy cannot find cublas, cudafft
                            
                                ImportError: libcudart.so.7.0: cannot open shared object file: No such file or directory
                            
                                How do I know that cudaMemcpyAsync is done reading host memory?
                            
                                CUDA - Parallel Reduction Sum
                            
                                Optimizing execution of a CUDA kernel for Triangular Matrix calculation
                            
                                Allocate constant memory
                            
                                scaling factor for CUFFT
                            
                                CUBLAS matrix multiplication

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Minimum number of GPU threads to be effective

Tags:

cuda

gpu

AdelNick

People also ask

1 Answers

harrism

Recent Activity

Donate For Us