Why does storing to and loading from an AVX2 256bit vector have different results in debug and release mode? [duplicate]

Question

When I try to store and load 256bits to and from an AVX2 256bit vector, I'm not receiving expected output in release mode.

use std::arch::x86_64::*;

fn main() {
    let key = [1u64, 2, 3, 4];
    let avxreg = unsafe { _mm256_load_si256(key.as_ptr() as *const __m256i) };
    let mut back_key = [0u64; 4];
    unsafe { _mm256_storeu_si256(back_key.as_mut_ptr() as *mut __m256i, avxreg) };
    println!("back_key: {:?}", back_key);
}

playground

In debug mode:

back_key: [1, 2, 3, 4]

In release mode:

back_key: [1, 2, 0, 0]

The back half either isn't being loaded or stored and I can't figure out which.

What's weird is targeting a native CPU works. In release mode + RUSTFLAGS="-C target-cpu=native"

back_key: [1, 2, 3, 4]

I've even tried to rid myself of Clippy errors by forcing alignment to no avail (I'm not sure if the code below is even considered more correct).

use std::arch::x86_64::*;

#[repr(align(256))]
#[derive(Debug)]
struct Key([u64; 4]);

fn main() {
    let key = Key([1u64, 2, 3, 4]);
    let avxreg = unsafe { _mm256_load_si256(&key as *const _ as *const __m256i) };
    let mut back_key = Key([0u64; 4]);
    unsafe { _mm256_storeu_si256((&mut back_key) as *mut _ as *mut __m256i, avxreg) };
    println!("back_key: {:?}", back_key);
}

Why is this happening?
Is there a fix for this specific use case?
Can this fix be generalized for user input (e.g.: if I wanted to take a byte slice as user input and do the same procedure)

Nick Babcock · Accepted Answer

After more thoroughly reading the docs, it became clear that I had to extract the body into another function and force that function to be compiled with AVX2 by annotating it with

#[target_feature(enable = "avx2")]

Or compile the entire program with

RUSTFLAGS="-C target-feature=+avx2" cargo run --release

The first option is better because it guarantees that the SIMD instructions used in a function are compiled appropriately, it's just on the caller to check their CPU has those capabilities before calling with is_x86_feature_detected!("avx2"). All this is documented, but it would be amazing if the compiler could warn with "hey, this function uses AVX2 instructions, but was not annotated with #[target_feature(enable = "avx2")] and the program was not compiled with AVX2 enabled globally, so calling this function is undefined behavior". It would have saved me a lot of headache!

Since relying on undefined behavior is bad, our initial program on the playground should be written as:

use std::arch::x86_64::*;

fn main() {
    unsafe { run() }
}

#[target_feature(enable = "avx2")]
unsafe fn run() {
    let key = [1u64, 2, 3, 4];
    let avxreg = _mm256_load_si256(key.as_ptr() as *const __m256i);
    let mut back_key = [0u64; 4];
    _mm256_storeu_si256(back_key.as_mut_ptr() as *mut __m256i, avxreg);
    println!("back_key: {:?}", back_key);
}

Some notes:

main cannot be unsafe and thus can't be annotated with target_feature, so it is necessary to extract into another function
This still assumes the x86_64 CPU running the code has avx capabilities, so make sure you check before calling
It's not worth looking into why the debug version gives correct results, as running it under release on my home computer also gives correct results (under certain incantations). Looking at assembly shows that LLVM optimized one way or the other, but it is not particularly insightful.

Why does storing to and loading from an AVX2 256bit vector have different results in debug and release mode? [duplicate]

Tags:

compiler-optimization

rust

simd

avx2

Nick Babcock

1 Answers

Nick Babcock

Recent Activity

Donate For Us

Why does storing to and loading from an AVX2 256bit vector have different results in debug and release mode? [duplicate]

Tags:

compiler-optimization

rust

simd

avx2

Nick Babcock

1 Answers

Nick Babcock

Related questions

Recent Activity

Donate For Us