Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenCV Sum of squared differences speed

I've been using the openCV to do some block matching and I've noticed it's sum of squared differences code is very fast compared to a straight forward for loop like this:

int SSD = 0;
for(int i =0; i < arraySize; i++)
    SSD += (array1[i] - array2[i] )*(array1[i] - array2[i]);

If I look at the source code to see where the heavy lifting happens, the OpenCV folks have their for loops do 4 squared difference calculations at a time in each iteration of the loop. The function to do the block matching looks like this.

int64
icvCmpBlocksL2_8u_C1( const uchar * vec1, const uchar * vec2, int len )
{
int i, s = 0;
int64 sum = 0;

for( i = 0; i <= len - 4; i += 4 ) 
{   
    int v = vec1[i] - vec2[i];
    int e = v * v;

    v = vec1[i + 1] - vec2[i + 1]; 
    e += v * v;
    v = vec1[i + 2] - vec2[i + 2];
    e += v * v;
    v = vec1[i + 3] - vec2[i + 3];
    e += v * v;
    sum += e;
}

for( ; i < len; i++ )
{
    int v = vec1[i] - vec2[i];

    s += v * v;
}

return sum + s;
}

This calculation is for unsigned 8 bit integers. They perform a similar calculation for 32-bit floats in this function:

double
icvCmpBlocksL2_32f_C1( const float *vec1, const float *vec2, int len )
{
double sum = 0;
int i;

for( i = 0; i <= len - 4; i += 4 )
{
    double v0 = vec1[i] - vec2[i];
    double v1 = vec1[i + 1] - vec2[i + 1];
    double v2 = vec1[i + 2] - vec2[i + 2];
    double v3 = vec1[i + 3] - vec2[i + 3];

    sum += v0 * v0 + v1 * v1 + v2 * v2 + v3 * v3;
}
for( ; i < len; i++ )
{
    double v = vec1[i] - vec2[i];

    sum += v * v;
}
return sum;
}

I was wondering if anyone had any idea if breaking a loop up into chunks of 4 like this might speed up code? I should add that there is no multithreading occuring in this code.

like image 928
ncRubert Avatar asked Apr 14 '12 17:04

ncRubert


2 Answers

My guess is that this is just a simple implementation of unrolling the loop - it saves 3 additions and 3 compares on each pass of the loop, which can be a great savings if, for example, checking len involves a cache miss. The downside is that this optimization adds code complexity (e.g. the additional for loop at the end to finish the loop for the len % 4 items left if the length is not evenly divisible by 4) and, of course, it's an architecture-dependent optimization whose magnitude of improvement will vary by hardware/compiler/etc...

Still, it's straightforward to follow compared to most optimizations and will probably result in some sort of performance increase regardless of the architecture, so it's low risk to just throw it in there and hope for the best. Since OpenCV is such a well-supported chunk of code, I'm sure that someone instrumented these chunks of code and found them to be well worth it - as you yourself have done.

like image 184
Matt Avatar answered Nov 14 '22 13:11

Matt


There is one obvious optimisation of your code, viz:

int SSD = 0;
for(int i = 0; i < arraySize; i++)
{
    int v = array1[i] - array2[i];
    SSD += v * v;
}
like image 33
Philip Sheard Avatar answered Nov 14 '22 13:11

Philip Sheard