Why is this code slower when the array size is even?

Question

Warning: Actually it is not due to power of 2 but parity. See the EDIT section.

I have found a code which shows quite a strange behaviour.

The code uses a 2D array (size x size). When size is a power of 2, the code goes between 10% and 40% slower, the slowest being for size=32.

I have obtained these results with Intel compiler. If I compile with gcc 5.4, the power of 2 problem disappears but the code is 3x slower. Having tested it on different CPUs, I think it should be reproducible enough.

Code:

#define N 10000000

unsigned int tab1[TS][TS];

void simul1() {

  for(int i=0; i<TS; i++)
  for(int j=0; j<TS; j++) {

    if(i > 0)
      tab1[i][j] += tab1[i-1][j];

    if(j > 0)
      tab1[i][j] += tab1[i][j-1];
  }
}

int main() {

  for(int i=0; i<TS; i++)
  for(int j=0; j<TS; j++)
    tab1[j][i] = 0;


  for(int i=0; i<N; i++) {
    tab1[0][0] = 1;
    simul1();
  }

  return tab1[10][10];
}

Compilation:

icc:
        icc main.c -O3 -std=c99 -Wall -DTS=31 -o i31
        icc main.c -O3 -std=c99 -Wall -DTS=32 -o i32
        icc main.c -O3 -std=c99 -Wall -DTS=33 -o i33

gcc:
        gcc main.c -O3 -std=c99 -Wall -DTS=31 -o g31
        gcc main.c -O3 -std=c99 -Wall -DTS=32 -o g32
        gcc main.c -O3 -std=c99 -Wall -DTS=33 -o g33

Results on a Xeon E5:

time ./icc31
4.549s

time ./icc32
6.557s

time ./icc33
5.188s


time ./gcc31
13.335s

time ./gcc32
13.669s

time ./gcc33
14.399s

My questions:

Why is there a performance loss when the size of array is exactly 32 ?
(Optional: Why is icc 3x faster than gcc here ?)

EDIT: Actually this is due to parity, and only applies from size >= 32. The performance difference between even and odd numbers is consistent, and decreases when size becomes bigger. Here is a more detailed benchmark :

enter image description here

(The y axis is number of elements per second in millions, obtained with TS² * N / size / 1000000)

My CPU has a 32KB L1 cache and 256 KB L2

Grzegorz Szpetkowski · Accepted Answer

Why is icc 3x faster than gcc here?

GCC is not able to vectorize the inner loop, as it reports, that there are dependencies between data-refs. Intel's compiler is smart enough to split the inner loop into two independent parts:

for (int j = 1; j < TS; j++)
    tab1[i][j] += tab1[i-1][j];  // this one is vectorized

for (int j = 1; j < TS; j++)
    tab1[i][j] += tab1[i][j-1];

Attempt 1: Loops reorganization

You might get better performance in GCC by rewriting simul1 into:

void simul1(void)
{
    for (int j = 1; j < TS; j++)
        tab1[0][j] += tab1[0][j-1];

    for (int i = 1; i < TS; i++) {
        for (int j = 0; j < TS; j++)
            tab1[i][j] += tab1[i-1][j];
        for (int j = 1; j < TS; j++)
            tab1[i][j] += tab1[i][j-1];
    }
}

My results under GCC 6.3.0 with -O3 -march-native, TS = 32 powered by Intel Core i5 5200U are:

Original version:

real    0m21.110s
user    0m21.004s
sys 0m0.000s

Modified:

real    0m8.588s
user    0m8.536s
sys 0m0.000s

Attempt 2: Vectorization of prefix sum

After some reaseach, I found there is a possibility to vectorize the second inner loop by vector additions and shifts. Algorithm is presented here.

Version A: 128-bit wide vectors

#include "emmintrin.h"

void simul1(void)
{
    for (int j = 1; j < TS; j++)
        tab1[0][j] += tab1[0][j-1];

    for (int i = 1; i < TS; i++) {
        for (int j = 0; j < TS; j++)
            tab1[i][j] += tab1[i-1][j];

        for (int stride = 0; stride < TS; stride += 4) {
            __m128i v;
            v = _mm_loadu_si128((__m128i*) (tab1[i] + stride));
            v = _mm_add_epi32(v, _mm_slli_si128(v, sizeof(int)));
            v = _mm_add_epi32(v, _mm_slli_si128(v, 2*sizeof(int)));
            _mm_storeu_si128((__m128i*) (tab1[i] + stride), v);
        }

        for (int stride = 4; stride < TS; stride += 4)
            for (int j = 0; j < 4; j++)
                tab1[i][stride+j] += tab1[i][stride-1];
    }
}

Result:

real    0m7.541s
user    0m7.496s
sys 0m0.004s

Version B: 256-bit wide vectors

This one is more complicated. Consider eight-elements vector of ints:

V = (a, b, c, d, e, f, g, h)

We can treat it as two packed vectors:

(a, b, c, d), (e, f, g, h)

First, algorithm performs two independent summations:

(a, b, c, d), (e, f, g, h)
+
(0, a, b, c), (0, e, f, g)
=
(a, a+b, b+c, c+d), (e, e+f, f+g, g+h)
+
(0, 0, a, a+b), (0, 0, e, e+f)
=
(a, a+b, a+b+c, a+b+c+d), (e, e+f, e+f+g, e+f+g+h)

then it propagates last element of first vector into each element of second vector, so it finally yields:

(a, a+b, a+b+c, a+b+c+d), (a+b+c+d+e, a+b+c+d+e+f, a+b+c+d+e+f+g, a+b+c+d+e+f+g+h)

I suspect, that these intrinsics could be written better, so there is a potential for some improvement.

#include "immintrin.h"

void simul1(void)
{
    for (int j = 1; j < TS; j++)
        tab1[0][j] += tab1[0][j-1];

    for (int i = 1; i < TS; i++) {
        for (int j = 0; j < TS; j++)
            tab1[i][j] += tab1[i-1][j];

        for (int stride = 0; stride < TS; stride += 8) {
            __m256i v;
            v = _mm256_loadu_si256((__m256i*) (tab1[i] + stride));
            v = _mm256_add_epi32(v, _mm256_slli_si256(v, sizeof(int)));
            v = _mm256_add_epi32(v, _mm256_slli_si256(v, 2*sizeof(int)));

            __m256i t = _mm256_setzero_si256();
            t = _mm256_insertf128_si256(t, 
                _mm_shuffle_epi32(_mm256_castsi256_si128(v), 0xFF), 1);

            v = _mm256_add_epi32(v, t);
            _mm256_storeu_si256((__m256i*) (tab1[i] + stride), v);
        }

        for (int stride = 8; stride < TS; stride += 8)
            for (int j = 0; j < 8; j++)
                tab1[i][stride+j] += tab1[i][stride-1];
    }
}

Result (Clang 3.8):

real    0m5.644s
user    0m5.364s
sys 0m0.004s

MSalters · Answer

Looks like a classical case of cache contention. Your code is written such that there are operations on adjacent matrix rows and columns. This can be painful when the matrix rows align with cache lines, and would be stored in the same cache line.

But there's not a lot of data. If a line gets kicked out of fast L1 cache, it probably will still fit in the fairly fast L2 cache. L2 apparently is fast enough for the code that GCC emits, but L2 can't keep up with the (vectorized) code from ICC.

Why is this code slower when the array size is even?

Tags:

performance

c

optimization

gcc

icc

Saiph

2 Answers

Attempt 1: Loops reorganization

Attempt 2: Vectorization of prefix sum

Version A: 128-bit wide vectors

Version B: 256-bit wide vectors

Grzegorz Szpetkowski

MSalters

Recent Activity

Donate For Us

Why is this code slower when the array size is even?

Tags:

performance

c

optimization

gcc

icc

Saiph

2 Answers

Attempt 1: Loops reorganization

Attempt 2: Vectorization of prefix sum

Version A: 128-bit wide vectors

Version B: 256-bit wide vectors

Grzegorz Szpetkowski

MSalters

Related questions

Recent Activity

Donate For Us