Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Demosaicing algorithm that contains downsampling

Introduction: What I am working on.

Hello everyone! I am working on a Demosaicing algorithm which I use to transform images that have Bayer pattern into images that represent red, green and blue channels. I wish that the algorithm would have the following properties:

  1. It preserves as much raw information as possible.

  2. It does not obscure details in the image, even if that means absence of denoising.

  3. It produces as little artifacts as possible.

  4. If the size of mosaic image is N x N, the three color images should each have size N/2 x N/2.

  5. Algorithm should be fast. To put "fast" into a context, let me say this: I will settle for something that is at least twice as fast as OpenCV's algorithm which uses bilinear interpolation.

What I have achieved so far.

So far, I've come up with the algorithm that uses bilinear interpolation and produces three images that have half size of the mosaic image. The algorithm is approximately 3-4 times faster than OpenCV's cvtColor algorithm that performs CV_BayerBG2BGR conversion (bilinear interpolation).

See the sketch of the Bayer pattern below to get an idea about how it works. I perform the interpolation at points marked by circles. The numbers represent the coefficients by which I multiply the underlying pixels in order to get interpolated value in the point marked by black circle.

enter image description here

You can observe the results of my algorithm below. I've also added the results of both demosaicing algorithms that are available in OpenCV (bilinear interpolation and variable number of gradients). Please note that while results of my algorithm look really poor in comparison, the OpenCV's bilinear interpolation results look almost exactly the same if I downsample them. This is of course expected as the underlying algorithm is the same.

enter image description here

... so finally: the question.

My current solution gives acceptable results for my project and it is also acceptably fast. However, I would be willing to use a up to twice slower algorithm if that would bring improvements to any of the 5 criteria listed above. The question then is: how to improve my algorithm without significantly hindering the performance?

I have enough programming experience for this task so I am not specifically asking for code snippets - the answers of any kind (code, links, suggestions - especially the ones based on past experiences) are welcome.

Some additional information:

  • I am working in C++.
  • The algorithm is highly optimized, it uses SSE instructions and it is non-parallel.
  • I work with large images (few MB in size); cache-awareness and avoiding multiple passes through image are very important.

I am not looking for general programming advice (such as optimization in general, etc.), but on the other hand some task-specific answers are more than welcome. Thank you in advance.

like image 488
Nejc Avatar asked Mar 30 '17 22:03

Nejc


Video Answer


2 Answers

High quality results are obtained by filtering the samples to their Nyquist frequency. Usually that's half the sample rate, but in this case since your red and blue samples only come at 1/2 the pixel rate then Nyquist will be 1/4 the sample rate. When you resize you need to filter to the lower of the Nyquist rate of the input and output, but since your output is 1/2 your input you again need to filter to 1/4.

The perfect filter is the Sinc function; it delivers 100% of the signal below the cutoff frequency and none above the cutoff. Unfortunately it's completely impractical, extending as it does to infinity. For practical applications a windowed Sinc is used instead, the most common of these being the Lanczos kernel. The window size is chosen on the basis of quality vs. speed, with higher orders being closer to the ideal Sinc function. In your case since speed is paramount I will suggest Lanczos2.

The cutoff frequency of the filter is inversely proportional to its width. Since in this case we want the cutoff to be half of the normal cutoff, the filter will be stretched to twice its normal width. Ordinarily a Lanczos2 filter will require inputs up to but not including +/-2 pixels from the center point; stretching it by 2 requires inputs up to +/-4 from the center.

The choice of a center point is completely arbitrary once you have a good cutoff filter. In your case you chose a point that was midway between 4 sample points. If we choose instead a point that is exactly on one of our input samples, some interesting things happen. Many of the filter coefficients become zero, which means those pixels don't have to be included in the calculations. In the example below I've centered on the Red pixel, and we find that red pixels need no filtering at all! Following is a diagram with Lanczos2 filter values scaled so that they total to 1.0 for each color, followed by the formulas that result.

Lanczos2 on a Bayer matrix

red = p[x][y]
green = (p[x][y-3] + p[x-3][y] + p[x+3][y] + p[x][y+3]) * -0.03125 +
        (p[x][y-1] + p[x-1][y] + p[x+1][y] + p[x][y+1]) * 0.28125
blue = (p[x-3][y-3] + p[x+3][y-3] + p[x-3][y+3] + p[x+3][y+3]) * 0.00391 +
       (p[x-1][y-3] + p[x+1][y-3] + p[x-3][y-1] + p[x+3][y-1] + p[x-3][y+1] + p[x+3][y+1] + p[x-1][y+3] + p[x+1][y+3]) * -0.03516 +
       (p[x-1][y-1] + p[x+1][y-1] + p[x-1][y+1] + p[x+1][y+1]) * 0.31641

If you'd prefer to keep everything in the integer domain this works very well with fixed point numbers too.

red = p[x][y]
green = ((p[x][y-3] + p[x-3][y] + p[x+3][y] + p[x][y+3]) * -32 +
         (p[x][y-1] + p[x-1][y] + p[x+1][y] + p[x][y+1]) * 288) >> 10
blue = ((p[x-3][y-3] + p[x+3][y-3] + p[x-3][y+3] + p[x+3][y+3]) * 4 +
        (p[x-1][y-3] + p[x+1][y-3] + p[x-3][y-1] + p[x+3][y-1] + p[x-3][y+1] + p[x+3][y+1] + p[x-1][y+3] + p[x+1][y+3]) * -36 +
        (p[x-1][y-1] + p[x+1][y-1] + p[x-1][y+1] + p[x+1][y+1]) * 324) >> 10

The green and blue pixel values may end up outside of the range 0 to max, so you'll need to clamp them when the calculation is complete.

like image 169
Mark Ransom Avatar answered Oct 20 '22 05:10

Mark Ransom


I'm a bit puzzled by your algorithm and won't comment on it... but to put some things into perspective...

OpenCV is a library which contains a lot of generic stuff to get the job done, and sometimes is deliberately not performance-over-optimized, there is a cost-maintainability tradeoff and “good enough is better than better”. There is a trove of people selling performance-optimized libraries implementing some of OpenCV's features, sometimes with the exact same API. I have not used it, but OpenCV has a cv::gpu::cvtColor() which could be achieving your goals, out of the box, assuming it's implemented for demosaicing, and that you have a suitable GPU.

Considering the bilinear demosaicing, a less-maintainable but more-optimized CPU implementation can run much faster than the one from OpenCV, I'd estimate above 250 Mpx/s on one mainstream CPU core.

Now to elaborate on the optimization path...

First, because demosaicing is a local operation, cache awareness is really not a significant problem.

An performance-optimized implementation will have different code paths depending on the image dimensions, Bayer pattern type, instruction sets supported by the CPU (and their speed/latency), for such a simple algorithm it's going to become a lot of code.

There are SIMD instructions to perform shuffling, arithmetic including averaging, streaming memory writes, which you'd find useful. Intel's summary is not so bad to navigate, and Agner Fog's site is also valuable for any kind of implementation optimization. AVX & AVX2 provide several interesting instructions for pixel processing.

If you are more the 80/20 kind of person (good for you!), you'll appreciate working with a tool like Halide which can generate optimized stencil code like a breeze (modulo the learning curve, which is already setting you back a few days from a 1-hour naive implementation or the 10 minutes using OpenCV) and especially handles the boundary conditions (image borders).

You can get a little further (or take an alternative road) by using compiler intrinsics to access specific CPU instructions, at this point your code is now 4x costlier (in terms of development cost), and will probably get you 99% as far as hand-crafted assembly ($$$ x4 again).

If you want to squeeze the last drop (not generally advisable), you will definitely have to perform days of implementation benchmarks, to see which sequence of instructions can get you the best performance.

But also, GPUs... you can use your integrated GPU to perform demosaicing, it may be a little faster than the CPU and has access to the main memory... of course you'd have to care about pre-allocating shared buffers. A discrete GPU would have a more significant transfer overhead, at these fill rates.

like image 36
cJ Zougloub Avatar answered Oct 20 '22 06:10

cJ Zougloub