I have a 32-byte aligned structure with 8 vectors in it:
struct ALIGN(32) Ray8
{
float x[8];
float y[8];
float z[8];
};
When using AVX2, I want to operate on these members in unison. When do I need to explicitly load them using _mm256_load_ps() instead of casting them? For example, using the following signature:
void GenerateRayDirections( __m256 * x, _m256 * y, _m256 * z ) { ... }
Invoked as
void GenerateRayDirections( (__m256*)ray.x, (__m256*)ray.y, (__m256*)ray.z );
I am using Intel's embree library and they have a vfloat8 class which internally stores the representation as a union of _m256 and float8, so there's no casting at all - but there also seems to be no load calls. If I embed vfloat8 classes instead:
void GenerateRayDirections( &ray.x.v, &ray.y.v, &ray.z.v );
I am looking for some guidance on when to load or cast?
In practice, there should be no difference between a cast or a call to _mm256_load_ps
as far as the generated assembly is concerned. As you point out, you can even get the desired result through a union.
All of them will generate loads and stores (vmov
) instructions under the hood, however.
Why might you prefer to call _mm256_load_ps
manually? Because it forces you to think about when the data gets moved from memory to a vector register. The downside of using casting and unions is that you may become unaware of the loads and stores. They come with significant latency penalties, much worse that what the high-level source code might indicate.
Another benefit of using intrinsics like _mm256_loadu_ps
is that you allow unaligned memory accesses without vicious crashes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With