Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Auto vectorization with Rust

I am new with rust/SIMD and I have the following code snippet as the bottleneck of my program, I am wondering if I can leverage on the autovectorization feature of it

fn is_subset(a: Vec<i64>, b: Vec<i64>) -> bool {
    for i in 0..a.len() {
        if (a[i] & !b[i]) != 0 {
            return false;
        }
    }
    true
}

I also have an alternative way for writing this (with iterator so trip count will be known up front), would this create autovectorization instead?

fn is_subset(a: Vec<i64>, b: Vec<i64>) -> bool {
    return a.iter().zip(b.iter()).all(|(x, y)| x & y == *x)
}

like image 614
xxx222 Avatar asked Oct 23 '25 17:10

xxx222


1 Answers

LLVM (and GCC) don't know how to auto-vectorize loops whose trip-count can't be calculated up front. This rules out search loops like this.

ICC classic can auto-vectorize such loops, but it's a C/C++ compiler, without a front-end for Rust.

Probably your only hope would be to manually loop over 2, 4, or 8-element chunks of the arrays, branchlessly calculating your condition based on all those elements. If you're lucky, LLVM might turn that into operations on one SIMD vector. So using that inner loop inside a larger loop could result in getting the compiler to make vectorized asm, for example using AVX vptest (which sets CF according to bitwise a AND (not b) having any non-zero bits).

i.e. manually express the "unrolling" of SIMD elements in your source, for a specific vector width.

Related re: getting compilers to auto-vectorize with x86 ptest:

  • Getting GCC to generate a PTEST instruction when using vector extensions
  • how to auto vectorization array comparison function
  • GCC C vector extension: How to check if result of ANY element-wise comparison is true, and which?

If the whole array is small enough, a branchless reduction over the whole array (ORing boolean results together) could be something a compiler is willing to do something with. If compiling for x86, you'd be hoping for asm like pandn / por in the loop, and a horizontal reduction at the end, so see if any bits were set in the intersection of a and not b.

i64 is only 2 elements per 16-byte vector, so the compiler would have to see a good strategy for it to be profitable to auto-vectorize, especially on a 64-bit machine. Having 32-byte vectors available makes it more attractive.

like image 167
Peter Cordes Avatar answered Oct 26 '25 09:10

Peter Cordes



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!