I've recently been playing with compute shaders and I'm trying to determine the most optimal way to setup my [numthreads(x,y,z)] and dispatch calls. My demo window is 800x600 and I am launching 1 thread per pixel. I am performing 2D texture modifications - nothing too heavy.
My first try was to specify
[numthreads(32,32,1)]
My Dispatch() calls are always
Dispatch(ceil(screenWidth/numThreads.x),ceil(screenHeight/numThreads.y),1)
So for the first instance that would be
Dispatch(25,19,1)
This ran at 25-26 fps. I then reduced to [numthreads(4,4,1)] which ran at 16 fps. Increasing that to [numthreads(16,16,1)] started yeilding nice results of about 30 fps. Toying with the Y thread group number [numthreads(16,8,1)] managed to push it to 32 fps.
My question is is there an optimal way to determine the thread number so I can utilize the GPU most effectively or is the just good ol' trial and error?
It's pretty GPU-specific but if you are on NVIDIA hardware you can try using the CUDA Occupancy Calculator.
I know you are using DirectCompute, but they map to the same underlying hardware. If you look at the output of FXC you can see the shared memory size and registers per thread in the assembly. Also you can deduce the compute capability from which card you have. Compute capability is the CUDA equivalent of profiles like cs_4_0, cs_4_1, cs_5_0, etc.
The goal is to increase the "occupancy", or in other words occupancy == 100% - %idle-due-to-HW-overhead
Profiling is the only way to guarantee maximum performance on a particular piece of hardware. But as a general rule, as long as you keep your live register count low (16 or lower) and don't use a ton of shared memory, thread groups of exactly 256 threads should be able to saturate most compute hardware (assuming you're dispatching at least 8 or so groups).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With