The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?
The longer version is that, consider a function,
void func(size_t n, void *in, void *out)
{
__m256i *in256 = reinterpret_cast<__m256i *>(in);
__m256i *out256 = reinterpret_cast<__m256i *>(out);
while (n >= 32) {
__m256i data = _mm256_loadu_si256(in256++);
// process data
_mm256_storeu_si256(out256++, data);
n -= 32;
}
// process the remaining n % 32 bytes;
}
If in
and out
are both 32-bytes aligned, then there's no penalty of using vmovdqu
instead of vmovdqa
. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.
In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?
Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.
As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:
__m256i next = _mm256_loadu_si256(in256++);
for(...){
__m256i data = next; // usually 0 cost
next = _mm256_loadu_si256(in256++);
// do computations and store data
}
If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With