I would like to perform a variable block size sum of absolute difference calculation with a 2-D array of 16 bit integers in a C++ program as efficiently as possible. I am interested in a real time block matching code. I was wondering if there were any software libraries available to do this? The code is running on windows XP and I'm stuck using Visual Studio 2010 to do the compiling. The CPU is a 2-core AMD Athlon 64 x2 4850e.
By variable block size sum of absolute difference(SAD) calculation I mean the following.
I have one smaller 2-D array I will call the template_grid
, and one larger 2-D array I will call the image
. I want to find the region in the image that minimizes the sum of the absolute difference between the pixels in the template and the pixels in the region in the image.
The simplest way to calculate the SAD in C++ if would be the following:
for(int shiftY = 0; shiftY < rangeY; shiftY++) {
for(int shiftX = 0; shiftX < rangeX; shiftX++) {
for(int x = 0; x < lenTemplateX; x++) {
for(int y = 0; y < lenTemplateY; y++) {
SAD[shiftY][shiftX]=abs(template_grid[x][y] - image[y + shiftY][x + shiftX]);
}
}
}
}
The SAD calculation for specific array sizes has been optimized in the Intel performance primitives library. However, the arrays I'm working with don't fit the sizes in these libraries.
There are two search ranges I work with,
a large range: rangeY = 45, rangeX = 10
a small range: rangeY = 4, rangeX = 2
There is only one template size and it is: lenTemplateY = 61, lenTemplateX = 7
Minor optimisation:
for(int shiftY = 0; shiftY < rangeY; shiftY++) {
for(int shiftX = 0; shiftX < rangeX; shiftX++) {
// if you can assume SAD is already filled with 0-es,
// you don't need the next line
SAD[shiftX][shiftY]=0;
for(int tx = 0, imx=shiftX; x < lenTemplateX; tx++,imx++) {
for(int ty = 0, imy=shiftY; y < lenTemplateY; ty++,imy++) {
// two increments of imx/imy may be cheaper than
// two addition with offsets
SAD[shiftY][shiftX]+=abs(template_grid[tx][ty] - image[imx][imy]);
}
}
}
}
May be a crazy idea for your configuration (C++ compiler worries me), but it may work. I offer no warranties, but give it a try.
The idea may work because your template_grid
sizes and the ranges are constant - thus known at compilation time.
Also, for this to work, your image
and template_grid
must be organised with the same layout (column first or row first) - the way your "sample code" is depicted in the question mixes the SAD x/y
with template_grid y/x
.
In the followings, I'll assume a "column first" organisation, so that SAD[ix]
denotes the ix
th column of your SAD**
matrix. The code goes just the same for "row first", except the name of the variables won't match the meaning of your value arrays.
So, let's start:
template <
typename sad_type, typename val_type,
size_t template_len
> struct sad1D_simple {
void operator()(
const val_type* img, const val_type* templ,
sad_type& result
) {
// template specialization recursion, with one less element to add
sad1D_simple<sad_type, val_type, template_len-1> one_shorter;
// call it incrementing the img and template offsets
one_shorter(img+1, templ+1, result);
// the add the contribution of the first diff we skipped over above
result+=abs(*(img+template_len-1)-*(templ+template_len-1));
}
};
// at len of 0, the result is zero. We need it to stop the
template <
typename sad_type, typename val_type
>
struct sad1D_simple<sad_type, val_type, 0> {
void operator()(
const val_type* img, const val_type* templ,
sad_type& result
) {
result=0;
}
};
Why a functor struct - struct with operator? The C++ doesn't allow partial specialization of function templates.
What the sad1D_simple
does: unrolls a for
cycle that computes the SAD
of two arrays in input without any offsetting, based on the fact that the length of your template_grid
array is a constant known at compile time. It's in the same vein as "computing the factorial of compile time using C++ templates"
How this helps?
Example of use in the code below:
typedef ulong SAD_t;
typedef int16_t pixel_val_t;
const size_t lenTemplateX = 7; // number of cols in the template_grid
const size_t lenTemplateY = 61;
const size_t rangeX=10, rangeY=45;
pixel_val_t **image, **template_grid;
SAD_t** SAD;
// assume those are initialized somehow
for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
pixel_val_t* template_col=template_grid[tgrid_col];
// the X axis - horizontal - is the column axis, right?
for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
pixel_val_t* img_col=image[shiftX];
for(size_t shiftY = 0; shiftY < rangeY; shiftY++) {
// the Y axis - vertical - is the "offset in a column"=row, isn't it?
pixel_val_t* img_col_offsetted=img_col+shiftY;
// this functor is made by recursive specialization
// there's no cycle inside it, it was unrolled into
// lenTemplateY individual subtractions, abs-es and additions
sad1D_simple<SAD_t, pixel_val_t, lenTemplateY> calc;
calc(img_col_offsetted, template_col, SAD[shiftX][shiftY]);
}
}
}
Mmmm... can we do better? No, it won't be the X-axis unrolling, we still want to stay in 1D area, but... well, maybe if we create a ranged sad1D
and unroll one more loop on the same axis?
It will work iff the rangeX
is also constant.
template <
typename sad_type, typename val_type,
size_t range, size_t template_len
> struct sad1D_ranged {
void operator()(
const val_type* img, const val_type* templ,
// result is assumed to have at least `range` slots
sad_type* result
) {
// we'll compute here the first slot of the result
sad1D_simple<sad_type, val_type, template_len>
calculator_for_first_sad;
calculator_for_first_sad(img, templ, *(result));
// now, ask for a recursive specialization for
// the next (range-1) sad-s
sad1D_ranged<sad_type, val_type, range-1, template_len>
one_less_in_range;
// when calling, pass the shifted img and result
one_less_in_range(img+1, templ, result+1);
}
};
// for a range of 0, there's nothing to do, but we need it
// to stop the template specialization recursion
template <
typename sad_type, typename val_type,
size_t template_len
> struct sad1D_ranged<sad_type, val_type, 0, template_len> {
void operator()(
const val_type* img, const val_type* templ,
// result is assumed to have at least `range` slots
sad_type* result
) {
}
};
And here's how you use it:
for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
pixel_val_t* template_col=template_grid[tgrid_col];
for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
pixel_val_t* img_col=image[shiftX];
SAD_t* sad_col=SAD[shiftX];
sad1D_ranged<SAD_t, pixel_val_t, rangeY, lenTemplateY> calc;
calc(img_col, template_col, sad_col);
}
}
Yes... but the question is: will this improve performance?
The heck if I know. For small number of loops within a cycle and for strong data locality (values close one to the other so that they are in the CPU caches), loop unrolling should improve the performance. For a larger number of loops, you may negatively interfere with the CPU branch prediction and other mumbo-jumbo-I-know-may-impact-performance-but-I-don't-know-how.
Feeling of guts: even if the same unrolling technique may work for the other two loops, using it may well result in a degradation of performance: we'll need to jump from one contiguous vector (an image
column) to the other - the entire image may not fit into the CPU cache.
Note: if your template_grid
data is constant as well (or you have a finite set of constant template grids), one may take one step further and create struct functors with dedicated masks. But I'm out of steam for today.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With