I have a struct consisting of seven __m256 values, which is stored 32-byte aligned in memory.
typedef struct
{
__m256 xl,xh;
__m256 yl,yh;
__m256 zl,zh;
__m256i co;
} bloxset8_t;
I achieve the 32-byte alignment by using the posix_memalign()
function for dynamically allocated data, or using the (aligned(32))
attribute for statically allocated data.
The alignment is fine, but when I use two pointers to such a struct, and pass them as destination and source for memcpy() then the compiler decides to use __memcpy_avx_unaligned()
to copy.
How can I force clang to use the aligned avx memcpy function instead, which I assume is the faster variant?
OS: Ubuntu 16.04.3 LTS, Clang: 3.8.0-2ubuntu4.
UPDATE
The __memcpy_avx_unaligned() is invoked only when copying two or more structs. When copying just one, clang emits 14 vmovup instructions.
As a summary, if functions like memset() and memcpy() are used to access Device memory, then the pointers must be to an aligned address.
Alignment refers to the arrangement of data in memory, and specifically deals with the issue of accessing data as proper units of information from main memory. First we must conceptualize main memory as a contiguous block of consecutive memory locations. Each location contains a fixed number of bits.
__memcpy_avx_unaligned
is just an internal glibc function name. It does not mean that there is a faster __memcpy_avx_aligned
function. The name is just convey a hint to the glibc developers how this memcpy
variant is implemented.
The other question is whether it would be faster for the C compiler to emit an inline expansion of memcpy
, using four AVX2 load/store operations. The code for that would be larger than the memcpy
call, but it might still be faster overall. It may be possible to help the compiler to do this using the __builtin_assume_aligned
builtin.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With