Does gcc (latest versions: 4.8, 4.9) have an "assume" clause similar to __assume()
built-in supported by icc?
E.g., __assume( n % 8 == 0 );
As of gcc 4.8.2, there is no equivalent of __assume() in gcc. I don't know why -- it would be very useful. mafso suggested:
#define __assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)
This is an old trick, known at least as far back as 2010 and probably longer. The compiler usually optimizes out the evaluation of 'cond' because any evaluation for which cond is false would be undefined anyway. However, it does not seem to optimize away 'cond' if it contains a call to an opaque (non-inlined) function. The compiler must assume that the opaque call might have a side-effect (e.g., change a global) and cannot optimize away the call, although it could optimize away any computations and branches on the result. For this reason, the macro approach is a partial solution, at best.
In your example you want to inform the compiler that N
is a multiple of 8.
You can do this simply by inserting the line
N = N & 0xFFFFFFF8;
in your code (if N
is a 32-bit integer). This doesn't change N
, because N
is a multiple of 8,
but since GCC 4.9 the compiler
seems to understand that N
is a multiple of 8, after this line.
This is shown by the next example, in which two float vectors are added:
int add_a(float * restrict a, float * restrict b, int N)
{
a = (float*)__builtin_assume_aligned(a, 32);
b = (float*)__builtin_assume_aligned(b, 32);
N = N & 0xFFFFFFF8;
for (int i = 0; i < N; i++){
a[i] = a[i] + b[i];
}
return 0;
}
int add_b(float * restrict a, float * restrict b, int N)
{
a = (float*)__builtin_assume_aligned(a, 32);
b = (float*)__builtin_assume_aligned(b, 32);
for (int i = 0; i < N; i++){
a[i] = a[i] + b[i];
}
return 0;
}
With gcc -m64 -std=c99 -O3
, gcc version 4.9, add_a
compiles to the vectorized code
add_a:
and edx, -8
jle .L6
sub edx, 4
xor ecx, ecx
shr edx, 2
lea eax, [rdx+1]
xor edx, edx
.L3:
movaps xmm0, XMMWORD PTR [rdi+rdx]
add ecx, 1
addps xmm0, XMMWORD PTR [rsi+rdx]
movaps XMMWORD PTR [rdi+rdx], xmm0
add rdx, 16
cmp ecx, eax
jb .L3
.L6:
xor eax, eax
ret
With function add_b
, more than 20 extra instructions are needed to handle the case that
N
is not a multiple of 8:
add_b:
test edx, edx
jle .L17
lea ecx, [rdx-4]
lea r8d, [rdx-1]
shr ecx, 2
add ecx, 1
cmp r8d, 2
lea eax, [0+rcx*4]
jbe .L16
xor r8d, r8d
xor r9d, r9d
.L11:
movaps xmm0, XMMWORD PTR [rdi+r8]
add r9d, 1
addps xmm0, XMMWORD PTR [rsi+r8]
movaps XMMWORD PTR [rdi+r8], xmm0
add r8, 16
cmp ecx, r9d
ja .L11
cmp eax, edx
je .L17
.L10:
movsx r8, eax
lea rcx, [rdi+r8*4]
movss xmm0, DWORD PTR [rcx]
addss xmm0, DWORD PTR [rsi+r8*4]
movss DWORD PTR [rcx], xmm0
lea ecx, [rax+1]
cmp edx, ecx
jle .L17
movsx rcx, ecx
add eax, 2
lea r8, [rdi+rcx*4]
cmp edx, eax
movss xmm0, DWORD PTR [r8]
addss xmm0, DWORD PTR [rsi+rcx*4]
movss DWORD PTR [r8], xmm0
jle .L17
cdqe
lea rdx, [rdi+rax*4]
movss xmm0, DWORD PTR [rdx]
addss xmm0, DWORD PTR [rsi+rax*4]
movss DWORD PTR [rdx], xmm0
.L17:
xor eax, eax
ret
.L16:
xor eax, eax
jmp .L10
See Godbolt link.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With