Is there a way to instruct GCC (I'm using 4.8.4) to unroll the while
loop in the bottom function completely, i.e., peel this loop? The number of iterations of the loop is known at compilation time: 58.
Let me first explain what I have tried.
By checking GAS ouput:
gcc -fpic -O2 -S GEPDOT.c
12 registers XMM0 - XMM11 are used. If I pass the flag -funroll-loops
to gcc:
gcc -fpic -O2 -funroll-loops -S GEPDOT.c
the loop is only unrolled two times. I checked the GCC optimization options. GCC says that -funroll-loops
will turn on -frename-registers
as well, so when GCC unrolls a loop, its prior choice for register allocation is to use "left over" registers. But there are only 4 left over XMM12 - XMM15, so GCC can only unroll 2 times at its best. Had there been 48 instead of 16 XMM registers available, GCC will unroll the while loop 4 times without trouble.
Yet I did another experiment. I first unrolled the while loop two time manually, obtaining a function GEPDOT_2
. Then there is no difference at all between
gcc -fpic -O2 -S GEPDOT_2.c
and
gcc -fpic -O2 -funroll-loops -S GEPDOT_2.c
Since GEPDOT_2
already used up all registers, no unrolling is performed.
GCC does register renaming to avoid potential false dependency introduced. But I know for sure that there will be no such potential in my GEPDOT
; even if there is, it is not important. I tried unrolling the loop myself, and unrolling 4 times is faster than unrolling 2 times, faster than no unrolling. Of course I can manually unroll more times, but it is tedious. Can GCC do this for me? Thanks.
// C file "GEPDOT.c"
#include <emmintrin.h>
void GEPDOT (double *A, double *B, double *C) {
__m128d A1_vec = _mm_load_pd(A); A += 2;
__m128d B_vec = _mm_load1_pd(B); B++;
__m128d C1_vec = A1_vec * B_vec;
__m128d A2_vec = _mm_load_pd(A); A += 2;
__m128d C2_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
__m128d C3_vec = A1_vec * B_vec;
__m128d C4_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
__m128d C5_vec = A1_vec * B_vec;
__m128d C6_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
__m128d C7_vec = A1_vec * B_vec;
A1_vec = _mm_load_pd(A); A += 2;
__m128d C8_vec = A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
int k = 58;
/* can compiler unroll the loop completely (i.e., peel this loop)? */
while (k--) {
C1_vec += A1_vec * B_vec;
A2_vec = _mm_load_pd(A); A += 2;
C2_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C3_vec += A1_vec * B_vec;
C4_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C5_vec += A1_vec * B_vec;
C6_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C7_vec += A1_vec * B_vec;
A1_vec = _mm_load_pd(A); A += 2;
C8_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
}
C1_vec += A1_vec * B_vec;
A2_vec = _mm_load_pd(A);
C2_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C3_vec += A1_vec * B_vec;
C4_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B); B++;
C5_vec += A1_vec * B_vec;
C6_vec += A2_vec * B_vec;
B_vec = _mm_load1_pd(B);
C7_vec += A1_vec * B_vec;
C8_vec += A2_vec * B_vec;
/* [write-back] */
A1_vec = _mm_load_pd(C); C1_vec = A1_vec - C1_vec;
A2_vec = _mm_load_pd(C + 2); C2_vec = A2_vec - C2_vec;
A1_vec = _mm_load_pd(C + 4); C3_vec = A1_vec - C3_vec;
A2_vec = _mm_load_pd(C + 6); C4_vec = A2_vec - C4_vec;
A1_vec = _mm_load_pd(C + 8); C5_vec = A1_vec - C5_vec;
A2_vec = _mm_load_pd(C + 10); C6_vec = A2_vec - C6_vec;
A1_vec = _mm_load_pd(C + 12); C7_vec = A1_vec - C7_vec;
A2_vec = _mm_load_pd(C + 14); C8_vec = A2_vec - C8_vec;
_mm_store_pd(C,C1_vec); _mm_store_pd(C + 2,C2_vec);
_mm_store_pd(C + 4,C3_vec); _mm_store_pd(C + 6,C4_vec);
_mm_store_pd(C + 8,C5_vec); _mm_store_pd(C + 10,C6_vec);
_mm_store_pd(C + 12,C7_vec); _mm_store_pd(C + 14,C8_vec);
}
Thanks to the comment by @user3386109, I would like to extend this question a little bit. @user3386109 raises a very good question. Actually I do have some doubt on compiler's ability for optimal register allocation, when there are so many parallel instructions to schedule.
I personally think that a reliable way is to first code the loop body (which is key to HPC) in asm inline assembly, then duplicate it as many times as I want. I had an unpopular post earlier this year: inline assembly. The code was a little different because the number of loop iterations, j, is a function argument hence unknown at compilation time. In that case I can not fully unroll the loop, so I only duplicated the assembly code twice, and converted the loop into a label and jump. It turned out that the resulting performance of my written assembly is about 5% higher than compiler generated assembly, which might suggest that compiler fails to allocate registers in our expected, optimal manner.
I was (and am still) a baby in assembly coding, so that serves a good case study for me to learn a little bit on x86 assembly. But in a long run I do not incline to code GEPDOT
with a big proportion for assembly. There are mainly three reasons:
nb
, the number of iterations would be nb-2
. I am not going to put nb
as a function argument, as I did in the earlier post. This is a machine specific parameter will be defined as a macro. So the number of iterations is known at compiled time, but may vary from machine to machine. Guess how much tedious work I have to do in manual loop unrolling for a variety of nb
. So if there is a way to simply instruct the compiler to peel a loop, that is great.I would be very appreciated if you can also share some experience in producing high performance, yet portable library.
This is not an answer, but might be of interest to others trying to vectorize matrix multiplications with GCC.
Below, I assume c is a 4×4 matrix in row-major order, a is a 4-row, n-column matrix in column-major order (transposed), b is a 4-column, n-row matrix in row-major order, and the operation to calculate is c = a × b + c, where × denotes matrix multiplication.
The naive function to accomplish this is
void slow_4(double *c,
const double *a,
const double *b,
size_t n)
{
size_t row, col, i;
for (row = 0; row < 4; row++)
for (col = 0; col < 4; col++)
for (i = 0; i < n; i++)
c[4*row+col] += a[4*i+row] * b[4*i+col];
}
GCC generates pretty good code for SSE2/SSE3 using
#if defined(__SSE2__) || defined(__SSE3__)
typedef double vec2d __attribute__((vector_size (2 * sizeof (double))));
void fast_4(vec2d *c,
const vec2d *a,
const vec2d *b,
size_t n)
{
const vec2d *const b_end = b + 2L * n;
vec2d s00 = c[0];
vec2d s02 = c[1];
vec2d s10 = c[2];
vec2d s12 = c[3];
vec2d s20 = c[4];
vec2d s22 = c[5];
vec2d s30 = c[6];
vec2d s32 = c[7];
while (b < b_end) {
const vec2d b0 = b[0];
const vec2d b2 = b[1];
const vec2d a0 = { a[0][0], a[0][0] };
const vec2d a1 = { a[0][1], a[0][1] };
const vec2d a2 = { a[1][0], a[1][0] };
const vec2d a3 = { a[1][1], a[1][1] };
s00 += a0 * b0;
s10 += a1 * b0;
s20 += a2 * b0;
s30 += a3 * b0;
s02 += a0 * b2;
s12 += a1 * b2;
s22 += a2 * b2;
s32 += a3 * b2;
b += 2;
a += 2;
}
c[0] = s00;
c[1] = s02;
c[2] = s10;
c[3] = s12;
c[4] = s20;
c[5] = s22;
c[6] = s30;
c[7] = s32;
}
#endif
For AVX, GCC can do even better with
#if defined(__AVX__) || defined(__AVX2__)
typedef double vec4d __attribute__((vector_size (4 * sizeof (double))));
void fast_4(vec4d *c,
const vec4d *a,
const vec4d *b,
size_t n)
{
const vec4d *const b_end = b + n;
vec4d s0 = c[0];
vec4d s1 = c[1];
vec4d s2 = c[2];
vec4d s3 = c[3];
while (b < b_end) {
const vec4d bc = *(b++);
const vec4d ac = *(a++);
const vec4d a0 = { ac[0], ac[0], ac[0], ac[0] };
const vec4d a1 = { ac[1], ac[1], ac[1], ac[1] };
const vec4d a2 = { ac[2], ac[2], ac[2], ac[2] };
const vec4d a3 = { ac[3], ac[3], ac[3], ac[3] };
s0 += a0 * bc;
s1 += a1 * bc;
s2 += a2 * bc;
s3 += a3 * bc;
}
c[0] = s0;
c[1] = s1;
c[2] = s2;
c[3] = s3;
}
#endif
The SSE3 version of the generated assembly using gcc-4.8.4 (-O2 -march=x86-64 -mtune=generic -msse3
) is essentially
fast_4:
salq $5, %rcx
movapd (%rdi), %xmm13
addq %rdx, %rcx
cmpq %rcx, %rdx
movapd 16(%rdi), %xmm12
movapd 32(%rdi), %xmm11
movapd 48(%rdi), %xmm10
movapd 64(%rdi), %xmm9
movapd 80(%rdi), %xmm8
movapd 96(%rdi), %xmm7
movapd 112(%rdi), %xmm6
jnb .L2
.L3:
movddup (%rsi), %xmm5
addq $32, %rdx
movapd -32(%rdx), %xmm1
addq $32, %rsi
movddup -24(%rsi), %xmm4
movapd %xmm5, %xmm14
movddup -16(%rsi), %xmm3
movddup -8(%rsi), %xmm2
mulpd %xmm1, %xmm14
movapd -16(%rdx), %xmm0
cmpq %rdx, %rcx
mulpd %xmm0, %xmm5
addpd %xmm14, %xmm13
movapd %xmm4, %xmm14
mulpd %xmm0, %xmm4
addpd %xmm5, %xmm12
mulpd %xmm1, %xmm14
addpd %xmm4, %xmm10
addpd %xmm14, %xmm11
movapd %xmm3, %xmm14
mulpd %xmm0, %xmm3
mulpd %xmm1, %xmm14
mulpd %xmm2, %xmm0
addpd %xmm3, %xmm8
mulpd %xmm2, %xmm1
addpd %xmm14, %xmm9
addpd %xmm0, %xmm6
addpd %xmm1, %xmm7
ja .L3
.L2:
movapd %xmm13, (%rdi)
movapd %xmm12, 16(%rdi)
movapd %xmm11, 32(%rdi)
movapd %xmm10, 48(%rdi)
movapd %xmm9, 64(%rdi)
movapd %xmm8, 80(%rdi)
movapd %xmm7, 96(%rdi)
movapd %xmm6, 112(%rdi)
ret
The AVX version of the generated assembly (-O2 -march=x86-64 -mtune=generic -mavx
) is essentially
fast_4:
salq $5, %rcx
vmovapd (%rdi), %ymm5
addq %rdx, %rcx
vmovapd 32(%rdi), %ymm4
cmpq %rcx, %rdx
vmovapd 64(%rdi), %ymm3
vmovapd 96(%rdi), %ymm2
jnb .L2
.L3:
addq $32, %rsi
vmovapd -32(%rsi), %ymm1
addq $32, %rdx
vmovapd -32(%rdx), %ymm0
cmpq %rdx, %rcx
vpermilpd $0, %ymm1, %ymm6
vperm2f128 $0, %ymm6, %ymm6, %ymm6
vmulpd %ymm0, %ymm6, %ymm6
vaddpd %ymm6, %ymm5, %ymm5
vpermilpd $15, %ymm1, %ymm6
vperm2f128 $0, %ymm6, %ymm6, %ymm6
vmulpd %ymm0, %ymm6, %ymm6
vaddpd %ymm6, %ymm4, %ymm4
vpermilpd $0, %ymm1, %ymm6
vpermilpd $15, %ymm1, %ymm1
vperm2f128 $17, %ymm6, %ymm6, %ymm6
vperm2f128 $17, %ymm1, %ymm1, %ymm1
vmulpd %ymm0, %ymm6, %ymm6
vmulpd %ymm0, %ymm1, %ymm0
vaddpd %ymm6, %ymm3, %ymm3
vaddpd %ymm0, %ymm2, %ymm2
ja .L3
.L2:
vmovapd %ymm5, (%rdi)
vmovapd %ymm4, 32(%rdi)
vmovapd %ymm3, 64(%rdi)
vmovapd %ymm2, 96(%rdi)
vzeroupper
ret
The register scheduling is not optimal, I guess, but it does not look atrocious either. I'm personally happy with the above, without trying to hand-optimize it at this point.
On a Core i5-4200U processor (AVX2-capable), the fast versions of the above functions compute the product of two 4×256 matrices in 1843 CPU cycles (median) for SSE3, and 1248 cycles for AVX2. That comes down to 1.8 and 1.22 cycles per matrix entry. The unvectorized slow version takes about 11 cycles per matrix entry, for comparison.
(The cycle counts are median values, i.e. half the tests were faster. I only ran some rough benchmarking with ~ 100k repeats or so, so do take these numbers with a grain of salt.)
On this CPU, cache effects are such that AVX2 at 4×512 matrix size is still at 1.2 cycles per entry, but at 4×1024, it drops to 1.4, at 4×4096 to 1.5, at 4×8192 to 1.8, and at 4×65536 to 2.2 cycles per entry. The SSE3 version stays at 1.8 cycles per entry up to 4×3072, at which point it starts slowing down; at 4×65536 it too is about 2.2 cycles per entry. I do believe this (laptop!) CPU is cache bandwidth limited at this point.
Try to tweak the optimizer parameters:
gcc -O3 -funroll-loops --param max-completely-peeled-insns=1000 --param max-completely-peel-times=100
This should do the trick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With