CUDA

Question

Is there any way I could EXPLICITLY limit the number of GPU multiprocessors being used during runtime of my program? I would like to calculate how my algorithm scales up with growing number of multiprocessors.

If it helps: I am using CUDA 4.0 and device with compute capability 2.0.

CygnusX1 · Accepted Answer

Aaahhh... I know the problem. I played with it myself when writing a paper.

There is no explicit way to do it, however you can try "hacking" it, by having some of the blocks doing nothing.

if you never launch more blocks as there are multiprocessors, then your work is easy - just launch even less blocks. Some of SM are guaranteed to have no work, because a block cannot split onto multiple SMs.
if you launch much more blocks and you just rely on the driver to schedule them, use a different approach: just launch as many blocks as your GPU can handle and if one of the blocks finishes its work, instead of terminating it, loop back to the beginning and fetch another piece of data to work on. Most likely, the performance of your program will not fall; it might even get better if you schedule your work carefully :)
The biggest problem is when all your blocks are all running on the GPU at once, but you have more than one block per SM. Then you need to launch normally, but manually "disable" some of the blocks and order other blocks to do the work for them. The problem is - which blocks to disable to guarantee that one SM is working and other is not.

From my own experiments, the 1.3 devices (I had GTX 285) schedules the blocks in sequence. So, if I launch 60 blocks onto 30 SMs, blocks 1-30 are scheduled onto SM 1-30 and then 31-60 again onto SM from 1 to 30. So, by disabling block 5 and 35, SM number 5 is practically not doing anything.

Note however, this is my private, experimental observation I made 2 years ago. It is in no way confirmed, supported, maintained, what-not by NVIDIA and may change (or already has changed) with the new GPUs and/or drivers.

I would suggest - try playing with some simple kernels which do a lot of stupid work and see how long does it take to compute on various "enabled"/"disabled" configurations. If you are lucky, you will catch a performance drop, indicating that 2 blocks are in fact executed by a single SM.

CUDA - limiting number of SMs being used

Tags:

c

Kylo

1 Answers

CygnusX1

Recent Activity

Donate For Us