Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

maximum number of threads on gpu

Tags:

cuda

gpu

tesla

I am using TESLA T10 device and it has 2 cuda devices and maximum number of threads in a block is 512 and maximum threads along each dimension is (512,512,64) and maximum grid size is (65535,65535,1) and it has 30 multiprocessors on each cuda device.

now i want to know how many threads i can run in parallel.i read previous solutions but none of them clear my doubt. from previous read =(30)*512 threads i can run in parallel(maxNoOfMultiprocessor * maxThreadBlockSize)

but when i launched 32 blocks of 512 threads still it is working how is it possible??? i am not understanding these maximum threads in each dimension and also maximum grid size part please explain with an example....... thanks in advance

like image 564
user2182259 Avatar asked Jan 11 '23 21:01

user2182259


2 Answers

For the purposes of this discussion, forget about how many multiprocessors there are. It has nothing to do with how many blocks you can launch in a kernel (i.e. the grid.)

The number of threads you can run in parallel (i.e. that can execute simultaneously) is different than the number of threads you can launch, or the number of blocks you can launch.

Normally, you do not want to launch grids that have only as many threads as the machine can run at a given time (maxNoOfMultiprocessor * maxThreadBlockSize). The machine wants many more threads than that, so it can hide latency.

Your machine is limited to 512 threads per block, but you can launch a single-dimensional grid of up to 65535 blocks. This does not mean that all those blocks/threads are running simultaneously, but the machine will process them all eventually.

like image 196
Robert Crovella Avatar answered Mar 15 '23 23:03

Robert Crovella


You can create many more threads than the hardware is able to handle simultaneously. This is called 'automatic scalability' by nVidia. If you have a card with 30 SMX, 30 blocs will be run in parallel, then 2 blocks will be run afterwards. If you then run the same program with 32 blocs on a card with only 16 SMX (let's suppose that exists), 16 blocks are run, and then 16 others.

like image 35
damienfrancois Avatar answered Mar 16 '23 00:03

damienfrancois