I'm working on an OpenGL implementation of the oculus Rift distortion shader. The shader works by taking the input texture coordinate (of a texture containing a previously rendered scene) and transforming it using distortion coefficients, and then using the transformed texture to determine the fragment color.
I'd hoped to improve performance by pre-computing the distortion and storing it in a second texture, but the result is actually slower than the direct computation.
The direct calculation version looks basically like this:
float distortionFactor(vec2 point) {
float rSq = lengthSquared(point);
float factor = (K[0] + K[1] * rSq + K[2] * rSq * rSq + K[3] * rSq * rSq * rSq);
return factor;
}
void main()
{
vec2 distorted = vRiftTexCoord * distortionFactor(vRiftTexCoord);
vec2 screenCentered = lensToScreen(distorted);
vec2 texCoord = screenToTexture(screenCentered);
vec2 clamped = clamp(texCoord, ZERO, ONE);
if (!all(equal(texCoord, clamped))) {
vFragColor = vec4(0.5, 0.0, 0.0, 1.0);
return;
}
vFragColor = texture(Scene, texCoord);
}
where K is a vec4 that's passed in as a uniform.
On the other hand, the displacement map lookup looks like this:
void main() {
vec2 texCoord = vTexCoord;
if (Mirror) {
texCoord.x = 1.0 - texCoord.x;
}
texCoord = texture(OffsetMap, texCoord).rg;
vec2 clamped = clamp(texCoord, ZERO, ONE);
if (!all(equal(texCoord, clamped))) {
discard;
}
if (Mirror) {
texCoord.x = 1.0 - texCoord.x;
}
FragColor = texture(Scene, texCoord);
}
There's a couple of other operations for correcting the aspect ratio and accounting for the lens offset, but they're pretty simple. Is it really reasonable to expect this to outperform a simple texture lookup?
GDDR memory is pretty high latency and modern GPU architectures have plenty of number crunching capabilities. It used to be the other way around, GPUs were so ill-equipped to do calculations that normalization was cheaper to do by fetching from a cube map.
Throw in the fact that you are not doing a regular texture lookup here, but rather a dependent lookup and it comes as no surprise. Since the location you are fetching from depends on the result of another fetch, it is impossible to pre-fetch / efficiently cache (an effective latency hiding strategy) the memory needed by your shader. That is no "simple texture lookup."
What is more, in addition to doing a dependent texture lookup your second shader also includes the discard
keyword. This will effectively eliminate the possibility of early depth testing on a lot of hardware.
Honestly, I do not see why you want to "optimize" the distortionFactor (...)
function into a lookup. It uses squared length, so you are not even dealing with a sqrt
, just a bunch of multiplication and addition.
Andon M. Coleman already explained what's going in. Essentially memory bandwith and more importantly memory latency are the main bottlenecks of modern GPUs, hence everthing built between about 2007 to today simple calculations are often way faster than a texture lookup.
In fact memory access patterns have such a large impact on efficiency that slightly rearranging the access pattern and assuring proper alignment can easily give performance boosts of a factor of 1000 (BT;DT however that was CUDA programming). Dependent lookup is not necessarily a performance killer, though: If the dependent texture coordinate lookup is monotonic with the controller texture it's usually not so bad.
That being said, did you never hear about Horner's Method? You can rewrite
float factor = (K[0] + K[1] * rSq + K[2] * rSq * rSq + K[3] * rSq * rSq * rSq);
trivially to
float factor = K[0] + rSq * (K[1] + rSq * (K[2] + rSq * K[3]) );
Saving you a couple of operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With