Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA cores vs thread count

I am confused by the relationship between the number of cores in an NVidia GPU, number of SMPs, and the max thread count. The device properties for my laptop's GT650m show 384 cores, 2 SMPs, with 1024 threads per SMP.

How are these numbers related to each other and warp size? I assume (perhaps incorrectly) that there are 192 cores per SMP, but that's not a factor of 1024. If each core runs a warp of 32 threads, I would expect 32 * 192 threads per SMP, or 2^5 * (2^7 + 2^6), or 4096 + 2048 = 6142.

What am I missing?

like image 777
3Dave Avatar asked Jun 07 '13 14:06

3Dave


1 Answers

I think you should have a deeper look into scheduling kernels in cuda.

There are two important sizes: blocks and threads per block

Each block is scheduled on one SM and is there then sliced into warps. Therefore blocks have a shared memory which is only accessible inside the block, because it lies on the SM memory. The number of blocks per SM depends on the device limit and occupancy calculation. Maximum blocks per SM is 8 for CC 1.0-2.x and 16 for CC 3.x.

Each block has a certain number of threads per block. The threads are divided into warps and the warps can be run in an arbitrary order only determined by the warp- scheduler an the SM.

Now your Card has a total Number of 384 cores on 2 SMs with 192 cores each. The CUDA core count represents the total number of single precision floating point or integer thread instructions that can be executed per cycle. Do not consider CUDA cores in any calculation.

The maximum number of threads varies per compute capability. CC2.0-3.x support a maximum of 1024 threads per block given sufficient registers and warp slots. Warps are statically assigned to warp schedulers. The number of warp schedulers per SM is 1 for CC 1.x, 2 for CC 2.x, and 4 for CC 3.x.

If your application does not executed concurrent kernels then to use each SM the gridDim should have >= number of SM blocks.

For GTX650m to fully use your compute-power you should have at least two blocks (otherwise with one block you could only use one SM). On the other hand if you want to schedule 10240 threads you could easily schedule 10 block of 1024 threads each.

like image 143
Michael Avatar answered Oct 09 '22 17:10

Michael