What is the most efficient way to implement a convolution filter within a pixel shader?

Tags:

Implementing convolution in a pixel shader is somewhat costly as to the very high number of texture fetches.

A direct way of implementing a convolution filter is to make N x N lookups per fragment using two for cycles per fragment. A simple calculation says that a 1024x1024 image blurred with a 4x4 Gaussian kernel would need 1024 x 1024 x 4 x 4 = 16M lookups.

What can one do about this?

Can one use some optimization that would need less lookups? I am not interested in kernel-specific optimizations like the ones for the Gaussian (or are they kernel specific?)
Can one at least make these lookups faster by somehow exploiting the locality of the pixels one would work with?

Thanks!

956

asked Mar 09 '11 09:03

Albus Dumbledore

1 Answers

Gaussian kernels are separable, which means you can do a horizontal pass first, then a vertical pass (or the other way around). That turns O(N^2) into O(2N). That works for all separable filters, not just for blur (not all filters are separable, but many are, and some are "as good as").

Or,in the particular case of a blur filter (Gauss or not), which are all kind of "weighted sums", you can take advantage of texture interpolation, which may be faster for small kernel sizes (but definitively not for large kernel sizes).

EDIT: image for the "linear interpolation" method

The "linear interpolation method"

EDIT (as requested by Jerry Coffin) to summarize the comments:

In the "texture filter" method, linear interpolation will produce a weighted sum of adjacent texels according to the inverse distance from the sample location to the texel center. This is done by the texturing hardware, for free. That way, 16 pixels can be summed in 4 fetches. Texture filtering can be exploited in addition to separating the kernel.

In the example image, on the top left, your sample (the circle) hits the center of a texel. What you get is the same as "nearest" filtering, you get that texel's value. On the top right, you are in the middle between two texels, what you get is the 50/50 average between them (pictured by the lighter shader of blue). On the bottom right, you sample in between 4 texels, but somewhat closer to the top left one. That gives you a weighted average of all 4, but with the weight biased towards the top left one (darkest shade of blue).

The following suggestions are courtesy of datenwolf (see below):

"Another methods I'd like suggest is operating in fourier space, where convolution turns into a simple product of fourier transformed signal and fourier transformed kernel. Although the fourier transform on the GPU itself is quite tedious to implement, at least using OpenGL shaders. But it's quite easy done in OpenCL. Actually I implement such things using OpenCL, now, a lot of image processing in my 3D engine happens in OpenCL.

OpenCL has been specifically designed for running on GPUs. A Fast Fourier Transform is actually the piece of example code on Wikipedia's OpenCL article: en.wikipedia.org/wiki/OpenCL and yes the performance gain is tremendous. A FFT executes with at most O(n log n), the reverse the same. The filter kernel fourier representation can be precomputed. The way is FFT -> multiply with kernel -> IFFT, which boils down to O(n + 2n log n) operations. Take note the the actual convolution is just O(n) there.

In the case of a separable, finite convolution like a gaussian blur the separation solution will outperform the fourier method. But in case of generalized, possible non-separable kernels the fourier methods is probably the fastest method available. OpenCL integrates nicely with OpenGL, e.g. you can use OpenGL buffers (textures and vertex) for both input and ouput of OpenCL programs."

192

answered Oct 22 '22 17:10

Damon

Related questions
                            
                                Which protobuf optimization?
                            
                                I need faster floating point math for .NET C# (for multiplying and dividing big arrays)
                            
                                Optimize LINQ Count() > X [duplicate]
                            
                                When will a C++11 compiler make RVO and NRVO outperform move semantics and const reference binding?
                            
                                `std::string` allocations are my current bottleneck - how can I optimize with a custom allocator?
                            
                                Why is this not working sometimes?
                            
                                When to use various language pragmas and optimisations?
                            
                                Calculate which products together would deliver the requested power
                            
                                Chrome: How to solve "Maximum call stack size exceeded" errors on Math.max.apply( Math, array )
                            
                                Do C compilers de-duplicate (merge) code?
                            
                                Optimizer: replace const reference with const object
                            
                                Are PostgreSQL temporary tables already unlogged?
                            
                                Techniques to Reduce CPU to GPU Data Transfer Latency
                            
                                stress testing web applications on less capable hardware
                            
                                hackerrank new year chaos code optimization
                            
                                MS C# compiler and non-optimized code
                            
                                Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?
                            
                                Clearing up the `hidden classes` concept of V8
                            
                                Django - rendering many templates using templatetags is very slow
                            
                                Replacing nested for loops and value assignment for list comprehension

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the most efficient way to implement a convolution filter within a pixel shader?

Tags:

optimization

opengl

glsl

shader

Albus Dumbledore

People also ask

1 Answers

Damon

Recent Activity

Donate For Us