I've written vectorized versions of some functions that are currently the bottleneck of an algorithm, using Eigen's facilities to do so.
I've also checked that AVX is enabled by making sure that EIGEN_VECTORIZE_AVX
is defined after including Eigen.
However, it seems that my function never gets called with Packet8f
(AVX), if the data size is not a multiple of 8. Instead, it gets called with Packet4f
(SSE).
Here is a small repro: https://gist.github.com/bitonic/e89561cb21837b4dee8b5f49e1303919 . Here I define an operation using Packet4f
and Packet8f
, and then count how many times each gets called with an array of size 8 and 9. When the array is of size 8, the Packet8f
version gets called once, as expected. When it's of size 9, the Packet4f
version gets called twice instead, plus a single call to the non-vectorized version. I've tested this code on Eigen's current master 1d0c45122a5c4c5c1c4309f904120e551bacad02
.
I've dug a bit and I believe that packet selection is happening here: https://gitlab.com/libeigen/eigen/blob/1d0c45122a5c4c5c1c4309f904120e551bacad02/Eigen/src/Core/util/XprHelper.h#L197 .
If I understand correctly, if the size of the data is not dynamic and not a multiple of 8 (that's the value of unpacket_traits<Packet8f>::size
), the half-packet will be selected, which matches what the reproduction above shows.
If my understanding is correct, why is that the case? Shouldn't the full packet be selected, with the remaining elements working with the non-vectorized operation?
Could it be that that condition is wrong, and should be a >= comparison instead, e.g. something like
template<int Size, typename PacketType,
bool Stop = Size==Dynamic || Size >= unpacket_traits<PacketType>::size || is_same<PacketType,typename unpacket_traits<PacketType>::half>::value>
struct find_best_packet_helper;
instead of
template<int Size, typename PacketType,
bool Stop = Size==Dynamic || (Size%unpacket_traits<PacketType>::size)==0 || is_same<PacketType,typename unpacket_traits<PacketType>::half>::value>
struct find_best_packet_helper;
I've verified that with the fix above the problem goes away.
However I might be misunderstanding what is going on here, since I'm not very well versed in Eigen internals.
Eigen has its own vectorization system, it does not at all rely on the compiler to automatically vectorize. However it still needs some support from the compiler, in the form of intrinsic functions representing a single SIMD instruction each.
SSE/AVX intrinsic functions use the following naming convention: <vector_size> is mm for 128 bit vectors (SSE), mm256 for 256 bit vectors (AVX and AVX2), and mm512 for AVX512. <intrin_op> Declares the operation of the intrinsic function.
Which SIMD instruction sets are supported by Eigen? Eigen supports SSE, AVX, AVX512, AltiVec/VSX (On Power7/8 systems in both little and big-endian mode), ARM NEON for 32 and 64-bit ARM SoCs, and now S390x SIMD (ZVector). With SSE, at least SSE2 is required.
Eigen is small, so it is feasible to include a copy of it in your own source tree, if you want to. Eigen is multi-platform, and is actually being used on a number of different operating systems, hardware platforms, and compilers. Eigen, compared to certain other C++ template libraries, is relatively easy on the compiler.
I have confirmed that this is due to how Eigen selects the packet type, see https://gitlab.com/libeigen/eigen/merge_requests/46 for a fix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With