Warning: Actually it is not due to power of 2 but parity. See the EDIT section.
I have found a code which shows quite a strange behaviour.
The code uses a 2D array (size x size). When size is a power of 2, the code goes between 10% and 40% slower, the slowest being for size=32.
I have obtained these results with Intel compiler. If I compile with gcc 5.4, the power of 2 problem disappears but the code is 3x slower. Having tested it on different CPUs, I think it should be reproducible enough.
Code:
#define N 10000000
unsigned int tab1[TS][TS];
void simul1() {
for(int i=0; i<TS; i++)
for(int j=0; j<TS; j++) {
if(i > 0)
tab1[i][j] += tab1[i-1][j];
if(j > 0)
tab1[i][j] += tab1[i][j-1];
}
}
int main() {
for(int i=0; i<TS; i++)
for(int j=0; j<TS; j++)
tab1[j][i] = 0;
for(int i=0; i<N; i++) {
tab1[0][0] = 1;
simul1();
}
return tab1[10][10];
}
Compilation:
icc:
icc main.c -O3 -std=c99 -Wall -DTS=31 -o i31
icc main.c -O3 -std=c99 -Wall -DTS=32 -o i32
icc main.c -O3 -std=c99 -Wall -DTS=33 -o i33
gcc:
gcc main.c -O3 -std=c99 -Wall -DTS=31 -o g31
gcc main.c -O3 -std=c99 -Wall -DTS=32 -o g32
gcc main.c -O3 -std=c99 -Wall -DTS=33 -o g33
Results on a Xeon E5:
time ./icc31
4.549s
time ./icc32
6.557s
time ./icc33
5.188s
time ./gcc31
13.335s
time ./gcc32
13.669s
time ./gcc33
14.399s
My questions:
EDIT: Actually this is due to parity, and only applies from size >= 32. The performance difference between even and odd numbers is consistent, and decreases when size becomes bigger. Here is a more detailed benchmark :
(The y axis is number of elements per second in millions, obtained with TS² * N / size / 1000000)
My CPU has a 32KB L1 cache and 256 KB L2
Why is icc 3x faster than gcc here?
GCC is not able to vectorize the inner loop, as it reports, that there are dependencies between data-refs. Intel's compiler is smart enough to split the inner loop into two independent parts:
for (int j = 1; j < TS; j++)
tab1[i][j] += tab1[i-1][j]; // this one is vectorized
for (int j = 1; j < TS; j++)
tab1[i][j] += tab1[i][j-1];
You might get better performance in GCC by rewriting simul1
into:
void simul1(void)
{
for (int j = 1; j < TS; j++)
tab1[0][j] += tab1[0][j-1];
for (int i = 1; i < TS; i++) {
for (int j = 0; j < TS; j++)
tab1[i][j] += tab1[i-1][j];
for (int j = 1; j < TS; j++)
tab1[i][j] += tab1[i][j-1];
}
}
My results under GCC 6.3.0 with -O3 -march-native
, TS = 32
powered by Intel Core i5 5200U are:
Original version:
real 0m21.110s
user 0m21.004s
sys 0m0.000s
Modified:
real 0m8.588s
user 0m8.536s
sys 0m0.000s
After some reaseach, I found there is a possibility to vectorize the second inner loop by vector additions and shifts. Algorithm is presented here.
#include "emmintrin.h"
void simul1(void)
{
for (int j = 1; j < TS; j++)
tab1[0][j] += tab1[0][j-1];
for (int i = 1; i < TS; i++) {
for (int j = 0; j < TS; j++)
tab1[i][j] += tab1[i-1][j];
for (int stride = 0; stride < TS; stride += 4) {
__m128i v;
v = _mm_loadu_si128((__m128i*) (tab1[i] + stride));
v = _mm_add_epi32(v, _mm_slli_si128(v, sizeof(int)));
v = _mm_add_epi32(v, _mm_slli_si128(v, 2*sizeof(int)));
_mm_storeu_si128((__m128i*) (tab1[i] + stride), v);
}
for (int stride = 4; stride < TS; stride += 4)
for (int j = 0; j < 4; j++)
tab1[i][stride+j] += tab1[i][stride-1];
}
}
Result:
real 0m7.541s
user 0m7.496s
sys 0m0.004s
This one is more complicated. Consider eight-elements vector of int
s:
V = (a, b, c, d, e, f, g, h)
We can treat it as two packed vectors:
(a, b, c, d), (e, f, g, h)
First, algorithm performs two independent summations:
(a, b, c, d), (e, f, g, h)
+
(0, a, b, c), (0, e, f, g)
=
(a, a+b, b+c, c+d), (e, e+f, f+g, g+h)
+
(0, 0, a, a+b), (0, 0, e, e+f)
=
(a, a+b, a+b+c, a+b+c+d), (e, e+f, e+f+g, e+f+g+h)
then it propagates last element of first vector into each element of second vector, so it finally yields:
(a, a+b, a+b+c, a+b+c+d), (a+b+c+d+e, a+b+c+d+e+f, a+b+c+d+e+f+g, a+b+c+d+e+f+g+h)
I suspect, that these intrinsics could be written better, so there is a potential for some improvement.
#include "immintrin.h"
void simul1(void)
{
for (int j = 1; j < TS; j++)
tab1[0][j] += tab1[0][j-1];
for (int i = 1; i < TS; i++) {
for (int j = 0; j < TS; j++)
tab1[i][j] += tab1[i-1][j];
for (int stride = 0; stride < TS; stride += 8) {
__m256i v;
v = _mm256_loadu_si256((__m256i*) (tab1[i] + stride));
v = _mm256_add_epi32(v, _mm256_slli_si256(v, sizeof(int)));
v = _mm256_add_epi32(v, _mm256_slli_si256(v, 2*sizeof(int)));
__m256i t = _mm256_setzero_si256();
t = _mm256_insertf128_si256(t,
_mm_shuffle_epi32(_mm256_castsi256_si128(v), 0xFF), 1);
v = _mm256_add_epi32(v, t);
_mm256_storeu_si256((__m256i*) (tab1[i] + stride), v);
}
for (int stride = 8; stride < TS; stride += 8)
for (int j = 0; j < 8; j++)
tab1[i][stride+j] += tab1[i][stride-1];
}
}
Result (Clang 3.8):
real 0m5.644s
user 0m5.364s
sys 0m0.004s
Looks like a classical case of cache contention. Your code is written such that there are operations on adjacent matrix rows and columns. This can be painful when the matrix rows align with cache lines, and would be stored in the same cache line.
But there's not a lot of data. If a line gets kicked out of fast L1 cache, it probably will still fit in the fairly fast L2 cache. L2 apparently is fast enough for the code that GCC emits, but L2 can't keep up with the (vectorized) code from ICC.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With