Vectorizing a conditional involving shorts

Q: What is vectorizing programming?

Vectorization is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction, whereas scalar can only operate on pairs of operands at once.

Tags:

c++

performance

algorithm

vectorization

visual-c++

I'm using a compact struct of 2 unsigned shorts indicating a start and end position.
I need to be able to quickly determine if there are any Range objects with a length (difference from start to end) past a threshold value.

I'm going to have a huge quantity of objects each with their own Range array, so it is not feasible to track which Range objects are above the threshold in a list or something. This code is also going to be run very often (many times a second for each array), so it needs to be efficient.

struct Range
{
 unsigned short start;
 unsigned short end;
}

I will always have an array of Range sized 2^n. While I would like to abort as soon as I find something over the threshold, I'm pretty sure it'd be faster to simply OR it all together and check at the end... assuming I can vectorize the loop. Although if I could do an if statement on the chunk of results for each vector, that would be grand.

size_t rangecount = 1 << resolution;
Range* ranges = new Range[rangecount];

...

bool result = false;
for (size_t i = 0; i < ranges; ++i)
{
 result |= (range[i].end - range[i].start) > 4;
}

Not surprisingly, the auto-vectorizer gives the 1202 error because my data type isn't 32 or 64 bits wide. I really don't want to double my data size and make each field an unsigned int. So I'm guessing the auto-vectorizer approach is out for this.

Are there vector instructions that can handle 16 bit variables? If there are, how could I use them in c++ to vectorize my loop?

896

asked Dec 21 '12 16:12

user173342

1 Answers

You are wondering if any value is greater than 4?

Yes, there are SIMD instructions for this. It's unfortunate that the auto-vectorized isn't able to handle this scenario. Here's a vectorized algorithm:

diff_v = end_v - start_v; // _mm_hsub_epi16 
floor_v = max(4_v, diff_v); // _mm_max_epi16 
if (floor_v != 4_v) return true; // wide scalar comparison

Use _mm_sub_epi16 with a structure of arrays or _mm_hsub_epi16 with an array of structures.

Actually since start is stored first in memory, you will be working on start_v - end_v, so use _mm_min_epi16 and a vector of -4.

Each SSE3 instruction will perform 8 comparisons at a time. It will still be fastest to return early instead of looping. However, unrolling the loop a bit more may buy you additional speed (pass the first set of results into the packed min/max function to combine them).

So you end up with (approximately):

most_negative = threshold = _mm_set_epi64(0xFCFCFCFCFCFCFCFC); // vectorized -4

loop:
    a = load from range;
    b = load from range;
    diff = _mm_hsub_epi16(a, b);
    most_negative = _mm_min_epi16(most_negative, diff);

    // unroll by repeating the above four instructions 4 times or so
    if (most_negative != threshold) return true;
repeat loop

103

answered Oct 06 '22 07:10

Ben Voigt

Related questions
                            
                                What VS2010 C Project settings cause exes to require Compatibility Mode
                            
                                waitpid and pthread_cond_wait(3)
                            
                                How does virtual inheritance work? [closed]
                            
                                Title-bar's context menu [duplicate]
                            
                                IBO worse then GL_TRIANGLE_STRIP?
                            
                                create a ramdisk in C++ on linux
                            
                                Auto + static in-class constant initalization with meta-programming
                            
                                Boost asio tcp socket available reports incorrect number of bytes
                            
                                Blender: Impossible Cube [closed]
                            
                                c++ object created with new keyword does not render; object created on the stack does
                            
                                Inferring types whilst eliding moves/copies without a macro
                            
                                Why can't I chain a method to a constructor?
                            
                                CUDA Primes Generation
                            
                                Is there any C++ lightweight cross-platform file system library?
                            
                                node.js Addons using the cocoa framework
                            
                                How to write multiple protocol buffers' messages into an a appendable compressed file?
                            
                                How to work with 16 bit grayscale in OpenCV?
                            
                                Scalable automatic class registration in C++
                            
                                Boost.Python: Getting a python weak reference to a wrapped C++ object
                            
                                Remove item in std::list while leaving it allocated

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With