Unaligned load versus unaligned store

Question

The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?

The longer version is that, consider a function,

void func(size_t n, void *in, void *out)
{
    __m256i *in256 = reinterpret_cast<__m256i *>(in);
    __m256i *out256 = reinterpret_cast<__m256i *>(out);
    while (n >= 32) {
         __m256i data = _mm256_loadu_si256(in256++);
         // process data
         _mm256_storeu_si256(out256++, data);
         n -= 32;
    }
    // process the remaining n % 32 bytes;
}

If in and out are both 32-bytes aligned, then there's no penalty of using vmovdqu instead of vmovdqa. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.

In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?

chtz · Accepted Answer

Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.

As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:

__m256i next =  _mm256_loadu_si256(in256++);
for(...){
    __m256i data = next; // usually 0 cost
    next = _mm256_loadu_si256(in256++);
    // do computations and store data
}

If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).

Unaligned load versus unaligned store

Tags:

c++

performance

x86

avx

memory-alignment

Yan Zhou

1 Answers

chtz

Recent Activity

Donate For Us

Unaligned load versus unaligned store

Tags:

c++

performance

x86

avx

memory-alignment

Yan Zhou

1 Answers

chtz

Related questions

Recent Activity

Donate For Us