What is the reason for inner loop performance degradation after upgrade?

Question

I have a hand-rolled matrix algorithm which finds the largest number of a right lower square of a square matrix (thus when iterating, some parts are 'jumped' over) - stored as a dense matrix. After an update from vs2010 to vs2017 it seems to be much slower - about a 50% slowdown overall. After some investigation, this was located to the inner loop of a function finding absolute largest value. Looking at the assembler output, this seems be due to some extra mov instructions being inserted within the tight loop. Reworking the loop in different ways seems to solve or partly solve the issue. gcc doesn't seem to have this "issue" in comparison.

Simplified examples (fabs not always neccessary to reproduce):

#include <cmath>
#include <iostream>

int f_slow(double *A, size_t from, size_t w)
{
    double biga_absval = *A;
    size_t ir = 0,ic=0;
    for ( size_t j = 0; j < w; j++ ) {
      size_t n = j*w;
      for ( ; n < j*w+w; n++ ) {
        if ( fabs(A[n]) <= biga_absval ) {
          biga_absval = fabs( A[n] );
          ir   = j;
          ic   = n;
        }
        n++;
      }
    }

    std::cout << ir <<ic;
    return 0;
}

int f_fast(double *A, size_t from, size_t w)
{
    double* biga = A;
    double biga_absval = *biga;

    double* n_begin = A + from;
    double* n_end = A + w;
    for (double* A_n = n_begin; A_n < n_end; ++A_n) {
      if (fabs(*A_n) > biga_absval) {
        biga_absval = fabs(*A_n);
        biga = A_n;
      }
    }

    std::cout << biga;
    return 0;
}

int f_faster(double *A, size_t from, size_t w)
{
    double biga_absval = *A;
    size_t ir = 0,ic=0;
    for ( size_t j = 0; j < w; j++ ) {
      size_t n = j;
      for ( ; n < j*w+w; n++ ) {
        if ( fabs(A[n]) > biga_absval ) {
          biga_absval = fabs( A[n] );
          ir   = j;
          ic   = n - j*w;
        }
        n++;
      }
    }

    std::cout << ir <<ic;
    return 0;
}

Please note: examples were created to look at output only (and indexes etc. don't neccessarily make sense):

https://godbolt.org/z/q9rWwi

So my question is: is this just a (known?) optimizer bug (?) or is there some logic behind what in this case seems like a clear optimization miss ?

Using latest stable vs2017 15.9.5

Update: The extra movs I see is before the jump codes - easiest way to find in compiler explorer is to right click on the if and then "scroll to".

malik · Accepted Answer

Well, I don't know why VC gets worse in your case, but I would like to offer some hint how to safe some ops.

void f_faster( const double* A, const std::size_t w ) {
    double      biga_absval = A[ 0 ];
    std::size_t ir, ic_n;
    for ( std::size_t j = 0; j < w; ++j ) {
        const auto N = j * w + w;
        for ( std::size_t n = j; n < N; n += 2 ) {
            if ( const auto new_big_a = std::fabs( A[ n ] ); new_big_a > biga_absval ) {
                biga_absval = new_big_a;
                ir          = j;
                ic_n        = n;
            }
        }
    }

    std::cout << ir << ( ic_n - ir * w );
}

don't calculate ic in the inner loop, just store n for later use
use const to help the optimizer
don't evaluate std::fabs twice
post-increment creates a copy, that you don't need (probably optimized away)
store the loop's upper bound outside, otherwise it might be re-evaluated (probably optimized away)
just increment n by two, instead of two increments by one
don't initialize with unused values

Maybe that's already enough to get rid of the extra mov?

What is the reason for inner loop performance degradation after upgrade?

Tags:

c++

compiler-optimization

assembly

visual-c++

visual-studio-2017

darune

1 Answers

malik

Recent Activity

Donate For Us

What is the reason for inner loop performance degradation after upgrade?

Tags:

c++

compiler-optimization

assembly

visual-c++

visual-studio-2017

darune

1 Answers

malik

Related questions

Recent Activity

Donate For Us