Most of the time a branch is required in a CUDA or OpenCL program, like:
for (int i=0; i<width; i++)
{
if( i % threadIdx.x == 0)
quantity += i*i;
}
the code can always (or at least, most of the time) be rewritten in non-branching style:
for (int i=0; i<width; i++)
{
quantity += i*i* (i % threadIdx.x != 0);
}
the tradeoff seems to be either running in a single warp slot versus doing more computations on all the threads (in the second case, the sum is executed always, just that sometimes the value is zero)
Assuming that branching operations will take multiple warp slots for each possible branch, one would expect the second be consistently better than the first, Now my question is; Can i rely on the compiler to optimize 1) into 2) whenever it makes sense, or there is not a broadly-applicable criteria, which implies that it cannot be decided in general which one is better without trying and profiling?
Modulo operations are reasonably expensive: I am reasonably sure adding in the modulo would use more time than just having a single instruction that only 1 thread executes. Your single branching statement, an if
with no else
, will only hang the other threads while that if statment is being executed. Because gpus are optimized for very fast context switching, there should be very little cost for that.
You are advised against using long branching statements though: too much serial computation on the GPU (ie one thread doing all the work) negates the advantage of the parallelism.
In my experience - it is totally up to the compiler-writers to optimize these sorts of edge cases.
So can I think of any cases where 1) cannot be turned to 2) ? Here's one: I have written kernels where it was more efficient to run certain pieces of the computations every 10 threads or something like that in which case such an optimization cannot be inferred even though there is a mathematical operation (a divide an subtract) that can yield the same result irrespective of conditional versus "run on all but yield zero results".
However, even given that checking for threadId == 0 is a common enough scenario, I have no idea if it is actually optimized for. I would wager that it depends on the implementation AND even the device itself (CPU vs GPU).
You will have to try it to really find out what works best, not just because of the reason above but also because the work scheduler might behave differently based on how expensive it is to schedule/start/stop a set of threads as opposed to having them all run (and most provide a zero/identity result).
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With