Relation between number of blocks of threads and cuda cores on machine (in CUDA C)

Tags:

cuda

I have CUDA 2.1 installed on my machine and it has a graphic card with 64 cuda cores. I have written a program in which I initialize simultaneously 30000 blocks (and 1 thread per block). But am not getting satisfying results from the gpu (It performs slowly than the cpu)

Is it that the number of blocks must be smaller than or equal to the number of cores for good performance? Or is it that the performance has nothing to do with number of blocks

429

asked Jun 13 '11 11:06

Pushpak Dagade

2 Answers

CUDA cores are not exactly what you might call a core on a classical CPU. Indeed, they have to be viewed as nothing more than ALUs (Arithmetic and Logic Units), which are just able to compute ready operations.

You might know that threads are handled per warps (groups of 32 threads) inside the blocks you've defined. When your blocks are dispatched on the different SMs (Streaming Multiprocessors, they are the actual cores of the GPU), each SM schedules warps within a block to optimize the computation time in regard of the memory access time needed to get threads' input data.

The problem is threads are always handled through their belonging warp, so if you have only one thread per block, the SM it is running on won't be able to schedule through warps and you won't take advantage of the multiple CUDA cores available. Your CUDA cores will be waiting for data to process, since CUDA cores compute far quicker than data are retrieved through memory.

Having lots of blocks with few threads is not what the GPU is awaiting. In this case, you face the block per SM limitation (this number depends on your device), which force your GPU to spend a lot of time to put blocks on SM and then remove them to treat the next ones. You should rather increase the number of threads in your blocks instead of the number of blocks in your application.

174

answered Jan 04 '23 11:01

jopasserat

The warp size in all current CUDA hardware is 32. Using less than 32 threads per block (or not using a round multiple of 32 threads per block) just wastes cycles. As it stands, using 1 thread per block is leaving something like 95% of the ALU cycles of your GPU idle. That is the underlying reason for the poor performance.

answered Jan 04 '23 11:01

talonmies

Related questions
                            
                                is there a better and a faster way to copy from CPU memory to GPU using thrust?
                            
                                CUDA coalesced access to global memory
                            
                                CUDA5 Examples: Has anyone translated some cutil definitions to CUDA5?
                            
                                Is CUDA pinned memory zero-copy?
                            
                                How to list CUDA devices in windows 7 using cmd?
                            
                                Use of unique_ptr and cudaMalloc
                            
                                Parameters to CUDA kernels
                            
                                About cudaMemcpyAsync Function
                            
                                What is the OpenCL analogue for CUDA's __syncthreads() and blockIdx.x?
                            
                                Implementing Max Reduce in Cuda
                            
                                CUDA and pinned (page locked) memory not page locked at all?
                            
                                Why cuSparse is much slower than cuBlas for sparse matrix multiplication
                            
                                Large matrix multiplication on gpu
                            
                                GPU Programming?
                            
                                CUDA: What is the threads per multiprocessor and threads per block distinction? [duplicate]
                            
                                Does cuDNN library works with All nvidia graphic cards?
                            
                                PyCUDA: Querying Device Status (Memory specifically)
                            
                                How is a CUDA kernel launched?
                            
                                Obtaining the CUDA include dir in C++ targets with native-CUDA-support CMake?
                            
                                How can I get syntax highlighting for a .cu file in Visual Studio?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With