Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Assume" clause in gcc

Does gcc (latest versions: 4.8, 4.9) have an "assume" clause similar to __assume() built-in supported by icc? E.g., __assume( n % 8 == 0 );

like image 917
user2052436 Avatar asked Sep 04 '14 14:09

user2052436


2 Answers

As of gcc 4.8.2, there is no equivalent of __assume() in gcc. I don't know why -- it would be very useful. mafso suggested:

#define __assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)

This is an old trick, known at least as far back as 2010 and probably longer. The compiler usually optimizes out the evaluation of 'cond' because any evaluation for which cond is false would be undefined anyway. However, it does not seem to optimize away 'cond' if it contains a call to an opaque (non-inlined) function. The compiler must assume that the opaque call might have a side-effect (e.g., change a global) and cannot optimize away the call, although it could optimize away any computations and branches on the result. For this reason, the macro approach is a partial solution, at best.

like image 177
Pablo Halpern Avatar answered Oct 07 '22 01:10

Pablo Halpern


In your example you want to inform the compiler that N is a multiple of 8. You can do this simply by inserting the line

N = N & 0xFFFFFFF8;

in your code (if N is a 32-bit integer). This doesn't change N, because N is a multiple of 8, but since GCC 4.9 the compiler seems to understand that N is a multiple of 8, after this line.

This is shown by the next example, in which two float vectors are added:

int add_a(float * restrict a, float * restrict b, int N)
{
    a = (float*)__builtin_assume_aligned(a, 32);
    b = (float*)__builtin_assume_aligned(b, 32);
    N = N & 0xFFFFFFF8; 
    for (int i = 0; i < N; i++){
        a[i] = a[i] + b[i];
    }
    return 0;
}


int add_b(float * restrict a, float * restrict b, int N)
{
    a = (float*)__builtin_assume_aligned(a, 32);
    b = (float*)__builtin_assume_aligned(b, 32);
    for (int i = 0; i < N; i++){
        a[i] = a[i] + b[i];
    }
    return 0;
}

With gcc -m64 -std=c99 -O3, gcc version 4.9, add_a compiles to the vectorized code

add_a:
  and edx, -8
  jle .L6
  sub edx, 4
  xor ecx, ecx
  shr edx, 2
  lea eax, [rdx+1]
  xor edx, edx
.L3:
  movaps xmm0, XMMWORD PTR [rdi+rdx]
  add ecx, 1
  addps xmm0, XMMWORD PTR [rsi+rdx]
  movaps XMMWORD PTR [rdi+rdx], xmm0
  add rdx, 16
  cmp ecx, eax
  jb .L3
.L6:
  xor eax, eax
  ret

With function add_b, more than 20 extra instructions are needed to handle the case that N is not a multiple of 8:

add_b:
  test edx, edx
  jle .L17
  lea ecx, [rdx-4]
  lea r8d, [rdx-1]
  shr ecx, 2
  add ecx, 1
  cmp r8d, 2
  lea eax, [0+rcx*4]
  jbe .L16
  xor r8d, r8d
  xor r9d, r9d
.L11:
  movaps xmm0, XMMWORD PTR [rdi+r8]
  add r9d, 1
  addps xmm0, XMMWORD PTR [rsi+r8]
  movaps XMMWORD PTR [rdi+r8], xmm0
  add r8, 16
  cmp ecx, r9d
  ja .L11
  cmp eax, edx
  je .L17
.L10:
  movsx r8, eax
  lea rcx, [rdi+r8*4]
  movss xmm0, DWORD PTR [rcx]
  addss xmm0, DWORD PTR [rsi+r8*4]
  movss DWORD PTR [rcx], xmm0
  lea ecx, [rax+1]
  cmp edx, ecx
  jle .L17
  movsx rcx, ecx
  add eax, 2
  lea r8, [rdi+rcx*4]
  cmp edx, eax
  movss xmm0, DWORD PTR [r8]
  addss xmm0, DWORD PTR [rsi+rcx*4]
  movss DWORD PTR [r8], xmm0
  jle .L17
  cdqe
  lea rdx, [rdi+rax*4]
  movss xmm0, DWORD PTR [rdx]
  addss xmm0, DWORD PTR [rsi+rax*4]
  movss DWORD PTR [rdx], xmm0
.L17:
  xor eax, eax
  ret
.L16:
  xor eax, eax
  jmp .L10

See Godbolt link.

like image 35
wim Avatar answered Oct 07 '22 00:10

wim