Unfortunately I have an AMD piledriver cpu, which seems to have problems with AVX instructions:
Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes.
In my own experience, I've found mm256 intrinsics to be much slower than mm128, and I'm assuming it's because of the above reason.
I really want to code for the newest instruction set AVX though, while still being able to test builds on my machine at a reasonable speed. Is there a way to force mm256 intrinsics to use SSE instructions instead? I'm using VS 2015.
If there is no easy way, what about a hard way. Replace <immintrin.h>
with a custom made header containing my own definitions for the intrinsics which can be coded to use SSE? Not sure how plausible this is, prefer easier way if possible before I go through that work.
Use Agner Fog's Vector Class Library and add this to the command line in Visual Studio: -D__SSE4_2__ -D__XOP__
.
Then use an AVX sized vector such as Vec8f
for eight floats. When you compile without AVX enable it will use the file vectorf256e.h
which emulates AVX with two SSE registers. For example Vec8f
inherits from Vec256fe
which starts like this:
class Vec256fe {
protected:
__m128 y0; // low half
__m128 y1; // high half
If you compile with /arch:AVX -D__XOP__
the VCL will instead use the file vectorf256.h
and one AVX register. Then your code works for AVX and SSE with only a compiler switch change.
If you don't want to use XOP
don't use -D__XOP__
.
As Peter Cordes pointed out in his answer, if you your goal is only to avoid 256-bit load/stores then you may still want VEX encoded instructions (though it's not clear this will make a difference except in some special cases). You can do that with the vector class like this
Vec8f a;
Vec4f lo = a.get_low(); // a is a Vec8f type
Vec4f hi = a.get_high();
lo.store(&b[0]); // b is a float array
hi.store(&b[4]);
then compile with /arch:AVX -D__XOP__
.
Another option would be be one source file that uses Vecnf
and then do
//foo.cpp
#include "vectorclass.h"
#if SIMDWIDTH == 4
typedef Vec4f Vecnf;
#else
typedef Vec8f Vecnf;
#endif
and compile like this
cl /O2 /DSIMDWIDTH=4 foo.cpp /Fofoo_sse
cl /O2 /DSIMDWIDTH=4 /arch:AVX /D__XOP__ foo.cpp /Fofoo_avx128
cl /O2 /DSIMDWIDTH=8 /arch:AVX foo.cpp /Fofoo_avx256
This would create three executables with one source file. Instead of linking them you could just compile them with /c
and them make a CPU dispatcher. I used XOP
with avx128 because I don't think there is a good reason to use avx128 except on AMD.
You don't want to use SSE instructions. What you want is for 256b stores to be done as two separate 128b stores, still with VEX-coded 128b instructions. i.e. 128b AVX vmovups
.
gcc has -mavx256-split-unaligned-load
and ...-store
options (enabled as part of -march=sandybridge
for example, presumably also for Bulldozer-family (-march=bdver2
is piledriver). That doesn't solve the problem when the compiler knows the memory is aligned, though.
You could override the normal 256b store intrinsic with a macro like
// maybe enable this for all BD family CPUs?
#if defined(__bdver2) | defined(PILEDRIVER) | defined(SPLIT_256b_STORES)
#define _mm256_storeu_ps(addr, data) do{ \
_mm_storeu_ps( ((float*)(addr)) + 0, _mm256_extractf128_ps((data),0)); \
_mm_storeu_ps( ((float*)(addr)) + 4, _mm256_extractf128_ps((data),1)); \
}while(0)
#endif
gcc defines __bdver2
(Bulldozer version 2) for Piledriver (-march=bdver2
).
You could do the same for (aligned) _mm256_store_ps
, or just always use the unaligned intrinsic.
Compilers optimize the _mm256_extractf128(data,0)
to a simple cast. I.e. it should just compile to
vmovups [rdi], xmm0 ; if data is in xmm0 and addr is in rdi
vextractf128 [rdi+16], xmm0, 1
However, testing on godbolt shows that gcc and clang are dumb, and extract to a register and then store. ICC correctly generates the two-instruction sequence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With