Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The magic number of OpenCl registers

Tags:

nvidia

opencl

I wrote two different openCl kernels, used the nvidia profiler to get some information on them and found that both were using 63 registers per work-item.

I tried everything I could think of to lower this number (replace int with ushort, tried to declare variables inside {} to show the compiler when it could get rid of them) but it seems impossible to have this 63 changed!

Then I found another question about a kernel he wrote that uses...again 63 registers.

Of course this could be pure coincidence, but maybe there is a reason behind...a specific function used, a hardware limitation? Does anyone know?

like image 891
Istopopoki Avatar asked Dec 10 '12 17:12

Istopopoki


1 Answers

63 registers is the max. amount on most of the recent hardware like GTX 480 up to GTX 770. Only with a GTX 780 or Tesla K20 you get 255 registers per thread.

So when your kernel uses 63 registers it is most likely using more than 63 but they will be spilled to off-chip private memory (aka CUDA local memory). For example if your nvidia profiler reports 128bytes of local memory it means you need to get rid of 32 (spilled) registers before you can get below 63 hardware registers.

BTW: "8,192 32-bit registers per multiprocessor" means 8,192 registers for all the workgroups that are resident on the multiprocessor. But usually the number of workgroups is bound by the size of your workgroups and the number of registers your kernel needs. So for example if your kernel uses 63 registers and you have a workgroup size of 16^2 you get: 63*16^2 = 16128 registers per workgroup. Let's assume you have 64K registers per multiprocessor, then you can have 4 workgroups resident on each multiprocessor which would yield an occupancy of 25%.

like image 163
Hugo Maxwell Avatar answered Oct 17 '22 17:10

Hugo Maxwell