Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Visual Studio increment the loop pointer before dereferencing it?

I checked out Visual Studio 2012's assembly output from the following SIMD code:

    float *end = arr + sz;
    float *b = other.arr;
    for (float *a = arr; a < end; a += 4, b += 4)
    {
        __m128 ax = _mm_load_ps(a);
        __m128 bx = _mm_load_ps(b);
        ax = _mm_add_ps(ax, bx);
        _mm_store_ps(a, ax);
    }

The loop body is:

$LL11@main:
    movaps  xmm1, XMMWORD PTR [eax+ecx]
    addps   xmm1, XMMWORD PTR [ecx]
    add ecx, 16                 ; 00000010H
    movaps  XMMWORD PTR [ecx-16], xmm1
    cmp ecx, edx
    jb  SHORT $LL11@main

Why increment ecx by 16, only to subtract 16 when storing to it the next line?

like image 267
japreiss Avatar asked Sep 11 '13 04:09

japreiss


Video Answer


3 Answers

Well, there are basically two options here.

 add ecx, 16
 movaps XMMWORD PTR [ecx-16], xmm1 ; stall for ecx?
 cmp ecx, edx
 jb loop

or

 movaps XMMWORD PTR [ecx], xmm1
 add ecx, 16
 cmp ecx, edx ; stall for ecx?
 jb loop

In option 1 you have a potential stall between add and movaps. In option 2 you have a potential stall between add and cmp. However, there is also the issue of the execution unit used. add and cmp (=sub) use the ALU, while the [ecx-16] uses AGU (Address Generation Unit), I believe. So I suspect there might be a slight win in option 1 because ALU use is interleaved with AGU use.

like image 128
Igor Skochinsky Avatar answered Nov 01 '22 23:11

Igor Skochinsky


ADDPS has a latency of 3 cycles, plus a memory load, so the following ADD, which is much quicker, will complete before the next MOVAPS, that needs the result of ADDPS in the xmm1 register, can start.

like image 27
Stefano Tommesani Avatar answered Nov 01 '22 23:11

Stefano Tommesani


Indeed this is a bit strange.

Many compilers avoid to read a register in the instruction after it has modified because such code runs slower on some processors. Example:

; Code that runs fast:
add ecx, 16
mov esi, edi
cmp ecx, edx

; Code doing the same that may run slower:
mov esi, edi
add ecx, 16
cmp ecx, edx

For this reason compilers often change the order of the assembler instructions. However in your case this is definitely not the reason.

Maybe the optimization code of the compiler is not written 100% correctly and it therefore does this kind of "optimization".

like image 25
Martin Rosenau Avatar answered Nov 01 '22 23:11

Martin Rosenau