I've been using the openCV to do some block matching and I've noticed it's sum of squared differences code is very fast compared to a straight forward for loop like this:
int SSD = 0;
for(int i =0; i < arraySize; i++)
SSD += (array1[i] - array2[i] )*(array1[i] - array2[i]);
If I look at the source code to see where the heavy lifting happens, the OpenCV folks have their for loops do 4 squared difference calculations at a time in each iteration of the loop. The function to do the block matching looks like this.
int64
icvCmpBlocksL2_8u_C1( const uchar * vec1, const uchar * vec2, int len )
{
int i, s = 0;
int64 sum = 0;
for( i = 0; i <= len - 4; i += 4 )
{
int v = vec1[i] - vec2[i];
int e = v * v;
v = vec1[i + 1] - vec2[i + 1];
e += v * v;
v = vec1[i + 2] - vec2[i + 2];
e += v * v;
v = vec1[i + 3] - vec2[i + 3];
e += v * v;
sum += e;
}
for( ; i < len; i++ )
{
int v = vec1[i] - vec2[i];
s += v * v;
}
return sum + s;
}
This calculation is for unsigned 8 bit integers. They perform a similar calculation for 32-bit floats in this function:
double
icvCmpBlocksL2_32f_C1( const float *vec1, const float *vec2, int len )
{
double sum = 0;
int i;
for( i = 0; i <= len - 4; i += 4 )
{
double v0 = vec1[i] - vec2[i];
double v1 = vec1[i + 1] - vec2[i + 1];
double v2 = vec1[i + 2] - vec2[i + 2];
double v3 = vec1[i + 3] - vec2[i + 3];
sum += v0 * v0 + v1 * v1 + v2 * v2 + v3 * v3;
}
for( ; i < len; i++ )
{
double v = vec1[i] - vec2[i];
sum += v * v;
}
return sum;
}
I was wondering if anyone had any idea if breaking a loop up into chunks of 4 like this might speed up code? I should add that there is no multithreading occuring in this code.
My guess is that this is just a simple implementation of unrolling the loop - it saves 3 additions and 3 compares on each pass of the loop, which can be a great savings if, for example, checking len
involves a cache miss. The downside is that this optimization adds code complexity (e.g. the additional for loop at the end to finish the loop for the len % 4 items left if the length is not evenly divisible by 4) and, of course, it's an architecture-dependent optimization whose magnitude of improvement will vary by hardware/compiler/etc...
Still, it's straightforward to follow compared to most optimizations and will probably result in some sort of performance increase regardless of the architecture, so it's low risk to just throw it in there and hope for the best. Since OpenCV is such a well-supported chunk of code, I'm sure that someone instrumented these chunks of code and found them to be well worth it - as you yourself have done.
There is one obvious optimisation of your code, viz:
int SSD = 0;
for(int i = 0; i < arraySize; i++)
{
int v = array1[i] - array2[i];
SSD += v * v;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With