Block Matching optimization using x86/x64 Streaming SIMD Extension

This is going to be the very first SO Question I'm posting!

    std::cout << "Hello mighty StackOverflow!" << std::endl;

I'm trying to optimize a "Block Matching" implementation for stereo-vision application using Intel's SSE4.2 and/or AVX intrinsics. I'm using "Sum of Absolute Differences" to find the best matching block. In my case blockSize will be an odd number, such as 3 or 5. This a snippet of my C++ code:

    for (int i = 0; i < rows; ++i) {
        for (int j = 0; j < cols; ++j) {
            minS = INT_MAX;
            for (int k = 0; k <= beta; ++k) {
                S = 0;
                for (int l = i; l < i + blockSize; ++l) {
                    for (int m = j; m <= j + blockSize ; ++m) {
                        // adiff(a,b) === abs(a-b)
                        S += adiff(rImage.at<uchar>(l, m), lImage.at<uchar>(l, m + k));
                    }
                }
                if (S < minS) {
                    minS = S;
                    kStar = k;
                }
            }
            disparity.at<uchar>(i, j) = kStar;
        }
    }

I know that the Streaming SIMD Extension contain many instructions to facilitate block-matching using SAD such as _mm_mpsadbw_epu8 and _mm_sad_epu8 , but they all seam to be targeting blockSizes that are 4, 16 or 32. e.g. this code from Intel. My problem is that in my application blockSize is an odd number, mostly 3 or 5.

I have considered the following starting point:

            r0 = _mm_lddqu_si128 ((__m128i*)&rImage.at<uchar>(i, j));
            l0 = _mm_lddqu_si128 ((__m128i*)&lImage.at<uchar>(i, j));
            s0 = _mm_abs_epi8 (_mm_sub_epi8 (r0 , l0) );

but from here, I don't know of a means to sum up 3 or 5 consecutive bytes from s0!

I would appreciate any thoughts on this.

What is x86-sse (x86 Streaming SIMD Extensions)?

X86 Streaming SIMD Extensions (x86-SSE) refers to a collection of architectural enhancements that have steadily advanced the SIMD computing capabilities of the x86 platform. X86-SSE adds new registers and instructions that facilitate SIMD computations using packed floating-point data types.

What is Streaming SIMD Extensions (SSE)?

In computing, Streaming SIMD Extensions ( SSE) is a single instruction, multiple data ( SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!.

What is the SIMD benchmark test?

I create a benchmark test by comparing four implementations between a traditional way and three SIMD methods with significant results of time measurement. SIMD (Single Instruction Multiple Data) is a computing element that performs the same operation on multiple data items simultaneously.

What's new in SSE2 and x86 SIMD?

Second, SSE2 and the subsequent x86 SIMD extensions include a number of new packed integer instructions that require at least one operand to be an XMM register or a 128-bit memory location. These instructions are reviewed in the following sections.

I suspect if blocksize is as small as 3-5 bytes x 3-5 bytes, you'd get fairly little benefit from using SSE or similar instructions, because you'll spend far too much of the "gain" from doing the math quickly in "swizzling" (moving data from one place to another).

However, looking at the code, it looks like you are processing the same rImage[i, j] multiple times, which I think doesn't make sense.

Block Matching optimization using x86/x64 Streaming SIMD Extension

Tags:

c++

c

optimization

simd

sse

ɹɐʎɯɐʞ

People also ask

1 Answers

Mats Petersson

Recent Activity

Donate For Us

Block Matching optimization using x86/x64 Streaming SIMD Extension

Tags:

c++

c

optimization

simd

sse

ɹɐʎɯɐʞ

People also ask

1 Answers

Mats Petersson

Related questions

Recent Activity

Donate For Us