Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When the compiler reorders AVX instructions on Sandy, does it affect performance?

Please do not say this is premature microoptimization. I want to understand, as much as it is possible given my limited knowledge, how the described SB feature and assembly works, and make sure that my code makes use of this architectural feature. Thank you for understanding.

I've started to learn intrinsics a few days ago so the answer may seem obvious to some, but I don't have a reliable source of information to figure this out.

I need to optimize some code for a Sandy Bridge CPU (this is a requirement). Now I know that it can do one AVX multiply and one AVX add per cycle, and read this paper:

http://research.colfaxinternational.com/file.axd?file=2012%2F7%2FColfax_CPI.pdf

which shows how it can be done in C++. So, the problem is that my code won't get auto-vectorized using Intel's compiler (which is another requirement for the task), so I decided to implement it manually using intrinsics like this:

__sum1 = _mm256_setzero_pd();
__sum2 = _mm256_setzero_pd();
__sum3 = _mm256_setzero_pd();
sum = 0;
for(kk = k; kk < k + BS && kk < aW; kk+=12)
{
    const double *a_addr = &A[i * aW + kk];
    const double *b_addr = &newB[jj * aW + kk];
    __aa1 = _mm256_load_pd((a_addr));
    __bb1 = _mm256_load_pd((b_addr));
    __sum1 = _mm256_add_pd(__sum1, _mm256_mul_pd(__aa1, __bb1));

    __aa2 = _mm256_load_pd((a_addr + 4));
    __bb2 = _mm256_load_pd((b_addr + 4));
    __sum2 = _mm256_add_pd(__sum2, _mm256_mul_pd(__aa2, __bb2));

    __aa3 = _mm256_load_pd((a_addr + 8));
    __bb3 = _mm256_load_pd((b_addr + 8));
    __sum3 = _mm256_add_pd(__sum3, _mm256_mul_pd(__aa3, __bb3));
}
__sum1 = _mm256_add_pd(__sum1, _mm256_add_pd(__sum2, __sum3));
_mm256_store_pd(&vsum[0], __sum1);

The reason I manually unroll the loop like this is explained here:

Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

They say you need to unroll by a factor of 3 to achieve the best performance on Sandy. My naive testing confirms that this indeed runs better than without unrolling or 4-fold unrolling.

OK, so here is the problem. The icl compiler from Intel Parallel Studio 15 generates this:

    $LN149:
            movsxd    r14, r14d                                     ;78.49
    $LN150:
            vmovupd   ymm3, YMMWORD PTR [r11+r14*8]                 ;80.48
    $LN151:
            vmovupd   ymm5, YMMWORD PTR [32+r11+r14*8]              ;84.49
    $LN152:
            vmulpd    ymm4, ymm3, YMMWORD PTR [r8+r14*8]            ;82.56
    $LN153:
            vmovupd   ymm3, YMMWORD PTR [64+r11+r14*8]              ;88.49
    $LN154:
            vmulpd    ymm15, ymm5, YMMWORD PTR [32+r8+r14*8]        ;86.56
    $LN155:
            vaddpd    ymm2, ymm2, ymm4                              ;82.34
    $LN156:
            vmulpd    ymm4, ymm3, YMMWORD PTR [64+r8+r14*8]         ;90.56
    $LN157:
            vaddpd    ymm0, ymm0, ymm15                             ;86.34
    $LN158:
            vaddpd    ymm1, ymm1, ymm4                              ;90.34
    $LN159:
            add       r14d, 12                                      ;76.57
    $LN160:
            cmp       r14d, ebx                                     ;76.42
    $LN161:
            jb        .B1.19        ; Prob 82%                      ;76.42

To me, this looks like a mess, where the correct order (add next to multiply required to use the handy SB feature) is broken.

Question:

  • Will this assembly code leverage the Sandy Bridge feature I am referring to?

  • If not, what do I need to do in order to utilize the feature and prevent the code from becoming "tangled" like this?

Also, when there is only one loop iteration, the order is nice and clean, i.e. load, multiply, add, as it should be.

like image 794
iksemyonov Avatar asked Jan 04 '15 20:01

iksemyonov


1 Answers

With x86 CPUs many people expect to get the maximum FLOPS from the dot product

for(int i=0; i<n; i++) sum += a[i]*b[i];

but this turns out not to be the case.

What can give the maximum FLOPS is this

for(int i=0; i<n; i++) sum += k*a[i];

where k is a constant. Why is the CPU not optimized for the dot product? I can speculate. One of the things CPUs are optimized for is BLAS. BLAS is considering a building block of many other routines.

The Level-1 and Level-2 BLAS routines become memory bandwidth bound as n increases. It's only the Level-3 routines (e.g. Matrix Multiplication) which are capable of being compute bound. This is because the Level-3 computations go as n^3 and the reads as n^2. So the CPU is optimized for the Level-3 routines. The Level-3 routines don't need to optimize for a single dot product. They only need to read from one matrix per iteration (sum += k*a[i]).

From this we can conclude that the number of bits needed to be read each cycle to get the maximum FLOPS for the Level-3 routines is

read_size = SIMD_WIDTH * num_MAC

where num_MAC is the number of multiply–accumulate operations that can be done each cycle.

                   SIMD_WIDTH (bits)   num_MAC  read_size (bits)  ports used
Nehalem            128                 1         128              128-bits on port 2
Sandy Bridge       256                 1         256              128-bits port 2 and 3
Haswell            256                 2         512              256-bits port 2 and 3
Skylake            512                 2        1024              ?

For Nehalem-Haswell this agrees with what the hardware is capable of. I don't actually know that Skylake will be able to read 1024-bits per clock cycle but if it can't AVX512 won't be very interesting so I'm confident in my guess. A nice plot for Nahalem, Sandy Bridge, and Haswell for each port can be found at http://www.anandtech.com/show/6355/intels-haswell-architecture/8

So far I have ignored latency and dependency chains. To really get the maximum FLOPS you need to unroll the loop at least three times on Sandy Bridge (I use four because I find it inconvenient to work with multiples of three)

The best way to answer your question about performance is to find the theoretic best performance you expect for your operation and then compare how close your code get to this. I call this the efficiency. Doing this you will find that despite the reordering of the instructions you see in the assembly the performance is still good. But there are many other subtle issues you may need to consider. Here are three issues I encountered:

l1-memory-bandwidth-50-drop-in-efficiency-using-addresses-which-differ-by-4096.

obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62%

difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp.

I also suggest you consider using IACA to study the performance.

like image 190
Z boson Avatar answered Nov 21 '22 10:11

Z boson