Variable block size sum of absolute difference calculation in C++

Question

I would like to perform a variable block size sum of absolute difference calculation with a 2-D array of 16 bit integers in a C++ program as efficiently as possible. I am interested in a real time block matching code. I was wondering if there were any software libraries available to do this? The code is running on windows XP and I'm stuck using Visual Studio 2010 to do the compiling. The CPU is a 2-core AMD Athlon 64 x2 4850e.

By variable block size sum of absolute difference(SAD) calculation I mean the following.

I have one smaller 2-D array I will call the template_grid, and one larger 2-D array I will call the image. I want to find the region in the image that minimizes the sum of the absolute difference between the pixels in the template and the pixels in the region in the image.

The simplest way to calculate the SAD in C++ if would be the following:

for(int shiftY = 0; shiftY < rangeY; shiftY++) {
    for(int shiftX = 0; shiftX < rangeX; shiftX++) {
        for(int x = 0; x < lenTemplateX; x++) {
            for(int y = 0; y < lenTemplateY; y++) {
                SAD[shiftY][shiftX]=abs(template_grid[x][y] - image[y + shiftY][x + shiftX]);
            }
        }
    }
}

The SAD calculation for specific array sizes has been optimized in the Intel performance primitives library. However, the arrays I'm working with don't fit the sizes in these libraries.

There are two search ranges I work with,

a large range: rangeY = 45, rangeX = 10

a small range: rangeY = 4, rangeX = 2

There is only one template size and it is: lenTemplateY = 61, lenTemplateX = 7

Adrian Colomitchi · Accepted Answer

Minor optimisation:

for(int shiftY = 0; shiftY < rangeY; shiftY++) {
  for(int shiftX = 0; shiftX < rangeX; shiftX++) {
    // if you can assume SAD is already filled with 0-es, 
    // you don't need the next line
    SAD[shiftX][shiftY]=0;
    for(int tx = 0, imx=shiftX; x < lenTemplateX; tx++,imx++) {
      for(int ty = 0, imy=shiftY; y < lenTemplateY; ty++,imy++) {
        // two increments of imx/imy may be cheaper than 
        // two addition with offsets
        SAD[shiftY][shiftX]+=abs(template_grid[tx][ty] - image[imx][imy]);
      }
    }
  }
}

Loop unrolling using C++ templates

May be a crazy idea for your configuration (C++ compiler worries me), but it may work. I offer no warranties, but give it a try.

The idea may work because your template_grid sizes and the ranges are constant - thus known at compilation time.
Also, for this to work, your image and template_grid must be organised with the same layout (column first or row first) - the way your "sample code" is depicted in the question mixes the SAD x/y with template_grid y/x.
In the followings, I'll assume a "column first" organisation, so that SAD[ix] denotes the ix^th column of your SAD** matrix. The code goes just the same for "row first", except the name of the variables won't match the meaning of your value arrays.

So, let's start:

template <
  typename sad_type, typename val_type,
  size_t template_len
> struct sad1D_simple {
  void operator()(
    const val_type* img, const val_type* templ,
    sad_type& result
  ) {
    // template specialization recursion, with one less element to add
    sad1D_simple<sad_type, val_type, template_len-1> one_shorter;
    // call it incrementing the img and template offsets
    one_shorter(img+1, templ+1, result);
    // the add the contribution of the first diff we skipped over above
    result+=abs(*(img+template_len-1)-*(templ+template_len-1));
  }
};

// at len of 0, the result is zero. We need it to stop the
template <
  typename sad_type, typename val_type
>
struct sad1D_simple<sad_type, val_type, 0> {
  void operator()(
    const val_type* img, const val_type* templ,
    sad_type& result
  ) {
    result=0;
  }
};

Why a functor struct - struct with operator? The C++ doesn't allow partial specialization of function templates.
What the sad1D_simple does: unrolls a for cycle that computes the SAD of two arrays in input without any offsetting, based on the fact that the length of your template_grid array is a constant known at compile time. It's in the same vein as "computing the factorial of compile time using C++ templates"

How this helps?
Example of use in the code below:

typedef ulong SAD_t;
typedef int16_t pixel_val_t;

const size_t lenTemplateX = 7; // number of cols in the template_grid
const size_t lenTemplateY = 61;
const size_t rangeX=10, rangeY=45;

pixel_val_t **image, **template_grid;
SAD_t** SAD;
// assume those are initialized somehow


for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
  pixel_val_t* template_col=template_grid[tgrid_col];
  // the X axis - horizontal - is the column axis, right?
  for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
    pixel_val_t* img_col=image[shiftX];
    for(size_t shiftY = 0; shiftY < rangeY; shiftY++) {
      // the Y axis - vertical - is the "offset in a column"=row, isn't it?
      pixel_val_t* img_col_offsetted=img_col+shiftY;

      // this functor is made by recursive specialization
      // there's no cycle inside it, it was unrolled into
      // lenTemplateY individual subtractions, abs-es and additions 
      sad1D_simple<SAD_t, pixel_val_t, lenTemplateY> calc;
      calc(img_col_offsetted, template_col, SAD[shiftX][shiftY]);
    }
  }
}

Mmmm... can we do better? No, it won't be the X-axis unrolling, we still want to stay in 1D area, but... well, maybe if we create a ranged sad1D and unroll one more loop on the same axis?
It will work iff the rangeX is also constant.

template <
  typename sad_type, typename val_type,
  size_t range, size_t template_len
> struct sad1D_ranged {
  void operator()(
    const val_type* img, const val_type* templ,
    // result is assumed to have at least `range` slots
    sad_type* result
  ) {
    // we'll compute here the first slot of the result
    sad1D_simple<sad_type, val_type, template_len>
      calculator_for_first_sad;
    calculator_for_first_sad(img, templ, *(result));

    // now, ask for a recursive specialization for 
    // the next (range-1) sad-s
    sad1D_ranged<sad_type, val_type, range-1, template_len>
       one_less_in_range;
    // when calling, pass the shifted img and result
    one_less_in_range(img+1, templ, result+1);
  }
};

// for a range of 0, there's nothing to do, but we need it
// to stop the template specialization recursion
template <
  typename sad_type, typename val_type,
  size_t template_len
> struct sad1D_ranged<sad_type, val_type, 0, template_len> {
  void operator()(
    const val_type* img, const val_type* templ,
    // result is assumed to have at least `range` slots
    sad_type* result
  ) {
  }
};

And here's how you use it:

for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
  pixel_val_t* template_col=template_grid[tgrid_col];
  for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
    pixel_val_t* img_col=image[shiftX];
    SAD_t* sad_col=SAD[shiftX];

    sad1D_ranged<SAD_t, pixel_val_t, rangeY, lenTemplateY> calc;
    calc(img_col, template_col, sad_col);
  }
}

Yes... but the question is: will this improve performance?
The heck if I know. For small number of loops within a cycle and for strong data locality (values close one to the other so that they are in the CPU caches), loop unrolling should improve the performance. For a larger number of loops, you may negatively interfere with the CPU branch prediction and other mumbo-jumbo-I-know-may-impact-performance-but-I-don't-know-how.

Feeling of guts: even if the same unrolling technique may work for the other two loops, using it may well result in a degradation of performance: we'll need to jump from one contiguous vector (an image column) to the other - the entire image may not fit into the CPU cache.

Note: if your template_grid data is constant as well (or you have a finite set of constant template grids), one may take one step further and create struct functors with dedicated masks. But I'm out of steam for today.

Variable block size sum of absolute difference calculation in C++

Tags:

c++

algorithm

computer-vision

ncRubert

1 Answers

Loop unrolling using C++ templates

Adrian Colomitchi

Recent Activity

Donate For Us

Variable block size sum of absolute difference calculation in C++

Tags:

c++

algorithm

computer-vision

ncRubert

1 Answers

Loop unrolling using C++ templates

Adrian Colomitchi

Related questions

Recent Activity

Donate For Us