Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SIMD code works in Debug, but does not in Release

Tags:

rust

simd

This code works in debug mode, but panics because of the assert in release mode.

use std::arch::x86_64::*;

fn main() {
    unsafe {
        let a = vec![2.0f32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0];
        let b = -1.0f32;

        let ar = _mm256_loadu_ps(a.as_ptr());
        println!("ar: {:?}", ar);

        let br = _mm256_set1_ps(b);
        println!("br: {:?}", br);

        let mut abr = _mm256_setzero_ps();
        println!("abr: {:?}", abr);

        abr = _mm256_fmadd_ps(ar, br, abr);
        println!("abr: {:?}", abr);

        let mut ab = [0.0; 8];
        _mm256_storeu_ps(ab.as_mut_ptr(), abr);
        println!("ab: {:?}", ab);

        assert_eq!(ab[0], -2.0f32);
    }
}

(Playground)

like image 409
kali Avatar asked Feb 04 '23 20:02

kali


1 Answers

I can indeed confirm that this code causes the assert to trip in release mode:

$ cargo run --release
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/so53831502`
ar: __m256(2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
br: __m256(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0)
abr: __m256(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
abr: __m256(-1.0, -1.0, -1.0, -1.0, 0.0, 0.0, 0.0, 0.0)
ab: [-1.0, -1.0, -1.0, -1.0, 0.0, 0.0, 0.0, 0.0]
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `-1.0`,
 right: `-2.0`', src/main.rs:24:9

This appears to be a compiler bug, see here and here. In particular, you are calling routines like _mm256_set1_ps and _mm256_fmadd_ps, which require the CPU features avx and fma respectively, but neither your code nor your compilation command indicate to the compiler that such features should be used.

One way of fixing this is to tell the compiler to compile the entire program with both the avx and fma features enabled, like so:

$ RUSTFLAGS="-C target-feature=+avx,+fma" cargo run --release
   Compiling so53831502 v0.1.0 (/tmp/so53831502)
    Finished release [optimized] target(s) in 0.36s
     Running `target/release/so53831502`
ar: __m256(2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
br: __m256(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0)
abr: __m256(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
abr: __m256(-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
ab: [-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Another approach that achieves the same result is to tell the compiler to use all available CPU features on your CPU:

$ RUSTFLAGS="-C target-cpu=native" cargo run --release
   Compiling so53831502 v0.1.0 (/tmp/so53831502)
    Finished release [optimized] target(s) in 0.34s
     Running `target/release/so53831502`
ar: __m256(2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
br: __m256(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0)
abr: __m256(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
abr: __m256(-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
ab: [-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

However, both of these compilation commands produce binaries that can only run on CPUs that support the avx and fma features. If that's not a problem for you, then this is a fine solution. If you would instead like to build portable binaries, then you can perform CPU feature detection at runtime, and compile certain functions with specific CPU features enabled. It is then your responsibility to guarantee that said functions are only invoked when the corresponding CPU feature is enabled and available. This process is documented as part of the dynamic CPU feature detection section of the std::arch docs.

Here's an example that uses runtime CPU feature detection:

use std::arch::x86_64::*;
use std::process;

fn main() {
    if is_x86_feature_detected!("avx") && is_x86_feature_detected!("fma") {
        // SAFETY: This is safe because we're guaranteed to support the
        // necessary CPU features.
        unsafe { doit(); }
    } else {
        eprintln!("unsupported CPU");
        process::exit(1);
    }
}

#[target_feature(enable = "avx,fma")]
unsafe fn doit() {
    let a = vec![2.0f32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0];
    let b = -1.0f32;

    let ar = _mm256_loadu_ps(a.as_ptr());
    println!("ar: {:?}", ar);

    let br = _mm256_set1_ps(b);
    println!("br: {:?}", br);

    let mut abr = _mm256_setzero_ps();
    println!("abr: {:?}", abr);

    abr = _mm256_fmadd_ps(ar, br, abr);
    println!("abr: {:?}", abr);

    let mut ab = [0.0; 8];
    _mm256_storeu_ps(ab.as_mut_ptr(), abr);
    println!("ab: {:?}", ab);

    assert_eq!(ab[0], -2.0f32);
}

To run it, you no longer need to set any compilation flags:

$ cargo run --release
   Compiling so53831502 v0.1.0 (/tmp/so53831502)
    Finished release [optimized] target(s) in 0.29s
     Running `target/release/so53831502`
ar: __m256(2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
br: __m256(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0)
abr: __m256(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
abr: __m256(-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
ab: [-2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

If you run the resulting binary on a CPU that doesn't support either avx or fma, then the program should exit with an error message: unsupported CPU.

In general, I think the docs for std::arch could be improved. In particular, the key boundary at which you need to split your code is dependent upon whether your vector types appear in your function signature. That is, the doit routine does not require anything beyond the standard x86 (or x86_64) function ABI to call, and is thus safe to call from functions that don't otherwise support avx or fma. However, internally, the function has been told to compile its code using additional instruction set extensions based on the given CPU features. This is achieved via the target_feature attribute. If you, for example, supplied an incorrect target feature:

#[target_feature(enable = "ssse3")]
unsafe fn doit() {
    // ...
}

then the program exhibits the same behavior as your initial program.

like image 133
BurntSushi5 Avatar answered Feb 06 '23 15:02

BurntSushi5