This is going to be the very first SO Question I'm posting!
std::cout << "Hello mighty StackOverflow!" << std::endl;
I'm trying to optimize a "Block Matching" implementation for stereo-vision application using Intel's SSE4.2 and/or AVX intrinsics. I'm using "Sum of Absolute Differences" to find the best matching block. In my case blockSize
will be an odd number, such as 3 or 5. This a snippet of my C++ code:
for (int i = 0; i < rows; ++i) {
for (int j = 0; j < cols; ++j) {
minS = INT_MAX;
for (int k = 0; k <= beta; ++k) {
S = 0;
for (int l = i; l < i + blockSize; ++l) {
for (int m = j; m <= j + blockSize ; ++m) {
// adiff(a,b) === abs(a-b)
S += adiff(rImage.at<uchar>(l, m), lImage.at<uchar>(l, m + k));
}
}
if (S < minS) {
minS = S;
kStar = k;
}
}
disparity.at<uchar>(i, j) = kStar;
}
}
I know that the Streaming SIMD Extension contain many instructions to facilitate block-matching using SAD such as _mm_mpsadbw_epu8
and _mm_sad_epu8
, but they all seam to be targeting blockSize
s that are 4, 16 or 32. e.g. this code from Intel. My problem is that in my application blockSize
is an odd number, mostly 3 or 5.
I have considered the following starting point:
r0 = _mm_lddqu_si128 ((__m128i*)&rImage.at<uchar>(i, j));
l0 = _mm_lddqu_si128 ((__m128i*)&lImage.at<uchar>(i, j));
s0 = _mm_abs_epi8 (_mm_sub_epi8 (r0 , l0) );
but from here, I don't know of a means to sum up 3 or 5 consecutive bytes from s0
!
I would appreciate any thoughts on this.
X86 Streaming SIMD Extensions (x86-SSE) refers to a collection of architectural enhancements that have steadily advanced the SIMD computing capabilities of the x86 platform. X86-SSE adds new registers and instructions that facilitate SIMD computations using packed floating-point data types.
In computing, Streaming SIMD Extensions ( SSE) is a single instruction, multiple data ( SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!.
I create a benchmark test by comparing four implementations between a traditional way and three SIMD methods with significant results of time measurement. SIMD (Single Instruction Multiple Data) is a computing element that performs the same operation on multiple data items simultaneously.
Second, SSE2 and the subsequent x86 SIMD extensions include a number of new packed integer instructions that require at least one operand to be an XMM register or a 128-bit memory location. These instructions are reviewed in the following sections.
I suspect if blocksize is as small as 3-5 bytes x 3-5 bytes, you'd get fairly little benefit from using SSE or similar instructions, because you'll spend far too much of the "gain" from doing the math quickly in "swizzling" (moving data from one place to another).
However, looking at the code, it looks like you are processing the same rImage[i, j]
multiple times, which I think doesn't make sense.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With