Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the restrict keyword provide significant benefits in gcc/g++?

Has anyone ever seen any numbers/analysis on whether or not use of the C/C++ restrict keyword in gcc/g++ actual provides any significant performance boost in reality (and not just in theory)?

I've read various articles recommending / disparaging its use, but I haven't ran across any real numbers practically demonstrating either sides arguments.

EDIT

I know that restrict is not officially part of C++, but it is supported by some compilers and I've read a paper by Christer Ericson which strongly recommends its usage.

like image 697
Robert S. Barnes Avatar asked Dec 27 '09 08:12

Robert S. Barnes


People also ask

What is restrict keyword used for?

restrict keyword is mainly used in pointer declarations as a type qualifier for pointers. It doesn't add any new functionality. It is only a way for programmer to inform about an optimization that compiler can make.

What is G ++ GCC?

gcc. g++ is used to compile C++ program. gcc is used to compile C program.

Should I use GCC or G ++ for C++?

For c++ you should use g++. It's the same compiler (e.g. the GNU compiler collection). GCC or G++ just choose a different front-end with different default options.


3 Answers

The restrict keyword does a difference.

I've seen improvements of factor 2 and more in some situations (image processing). Most of the time the difference is not that large though. About 10%.

Here is a little example that illustrate the difference. I've written a very basic 4x4 vector * matrix transform as a test. Note that I have to force the function not to be inlined. Otherwise GCC detects that there aren't any aliasing pointers in my benchmark code and restrict wouldn't make a difference due to inlining.

I could have moved the transform function to a different file as well.

#include <math.h>  #ifdef USE_RESTRICT #else #define __restrict #endif   void transform (float * __restrict dest, float * __restrict src,                  float * __restrict matrix, int n) __attribute__ ((noinline));  void transform (float * __restrict dest, float * __restrict src,                  float * __restrict matrix, int n) {   int i;    // simple transform loop.    // written with aliasing in mind. dest, src and matrix    // are potentially aliasing, so the compiler is forced to reload   // the values of matrix and src for each iteration.    for (i=0; i<n; i++)   {     dest[0] = src[0] * matrix[0] + src[1] * matrix[1] +                src[2] * matrix[2] + src[3] * matrix[3];      dest[1] = src[0] * matrix[4] + src[1] * matrix[5] +                src[2] * matrix[6] + src[3] * matrix[7];      dest[2] = src[0] * matrix[8] + src[1] * matrix[9] +                src[2] * matrix[10] + src[3] * matrix[11];      dest[3] = src[0] * matrix[12] + src[1] * matrix[13] +                src[2] * matrix[14] + src[3] * matrix[15];      src  += 4;     dest += 4;   } }  float srcdata[4*10000]; float dstdata[4*10000];  int main (int argc, char**args) {   int i,j;   float matrix[16];    // init all source-data, so we don't get NANs     for (i=0; i<16; i++)   matrix[i] = 1;   for (i=0; i<4*10000; i++) srcdata[i] = i;    // do a bunch of tests for benchmarking.    for (j=0; j<10000; j++)     transform (dstdata, srcdata, matrix, 10000); } 

Results: (on my 2 Ghz Core Duo)

nils@doofnase:~$ gcc -O3 test.c nils@doofnase:~$ time ./a.out  real    0m2.517s user    0m2.516s sys     0m0.004s  nils@doofnase:~$ gcc -O3 -DUSE_RESTRICT test.c nils@doofnase:~$ time ./a.out  real    0m2.034s user    0m2.028s sys     0m0.000s 

Over the thumb 20% faster execution, on that system.

To show how much it depends on the architecture I've let the same code run on a Cortex-A8 embedded CPU (adjusted the loop count a bit cause I don't want to wait that long):

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.c root@beagleboard:~# time ./a.out  real    0m 7.64s user    0m 7.62s sys     0m 0.00s  root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -DUSE_RESTRICT test.c  root@beagleboard:~# time ./a.out  real    0m 7.00s user    0m 6.98s sys     0m 0.00s 

Here the difference is just 9% (same compiler btw.)

like image 104
Nils Pipenbrinck Avatar answered Oct 05 '22 07:10

Nils Pipenbrinck


Does the restrict keyword provide significant benefits in gcc / g++ ?

It can reduce the number of instructions as shown on the example below, so use it whenever possible.

GCC 4.8 Linux x86-64 exmample

Input:

void f(int *a, int *b, int *x) {   *a += *x;   *b += *x; }  void fr(int *restrict a, int *restrict b, int *restrict x) {   *a += *x;   *b += *x; } 

Compile and decompile:

gcc -g -std=c99 -O0 -c main.c objdump -S main.o 

With -O0, they are the same.

With -O3:

void f(int *a, int *b, int *x) {     *a += *x;    0:   8b 02                   mov    (%rdx),%eax    2:   01 07                   add    %eax,(%rdi)     *b += *x;    4:   8b 02                   mov    (%rdx),%eax    6:   01 06                   add    %eax,(%rsi)    void fr(int *restrict a, int *restrict b, int *restrict x) {     *a += *x;   10:   8b 02                   mov    (%rdx),%eax   12:   01 07                   add    %eax,(%rdi)     *b += *x;   14:   01 06                   add    %eax,(%rsi)  

For the uninitiated, the calling convention is:

  • rdi = first parameter
  • rsi = second parameter
  • rdx = third parameter

Conclusion: 3 instructions instead of 4.

Of course, instructions can have different latencies, but this gives a good idea.

Why GCC was able to optimize that?

The code above was taken from the Wikipedia example which is very illuminating.

Pseudo assembly for f:

load R1 ← *x    ; Load the value of x pointer load R2 ← *a    ; Load the value of a pointer add R2 += R1    ; Perform Addition set R2 → *a     ; Update the value of a pointer ; Similarly for b, note that x is loaded twice, ; because x may point to a (a aliased by x) thus  ; the value of x will change when the value of a ; changes. load R1 ← *x load R2 ← *b add R2 += R1 set R2 → *b 

For fr:

load R1 ← *x load R2 ← *a add R2 += R1 set R2 → *a ; Note that x is not reloaded, ; because the compiler knows it is unchanged ; "load R1 ← *x" is no longer needed. load R2 ← *b add R2 += R1 set R2 → *b 

Is it really any faster?

Ermmm... not for this simple test:

.text     .global _start     _start:         mov $0x10000000, %rbx         mov $x, %rdx         mov $x, %rdi         mov $x, %rsi     loop:         # START of interesting block         mov (%rdx),%eax         add %eax,(%rdi)         mov (%rdx),%eax # Comment out this line.         add %eax,(%rsi)         # END ------------------------         dec %rbx         cmp $0, %rbx         jnz loop         mov $60, %rax         mov $0, %rdi         syscall .data     x:         .int 0 

And then:

as -o a.o a.S && ld a.o && time ./a.out 

on Ubuntu 14.04 AMD64 CPU Intel i5-3210M.

I confess that I still don't understand modern CPUs. Let me know if you:

  • found a flaw in my method
  • found an assembler test case where it becomes much faster
  • understand why there wasn't a difference

The article Demystifying The Restrict Keyword refers to the paper Why Programmer-specified Aliasing is a Bad Idea (pdf) which says it generally doesn't help and provides measurements to back this up.

like image 43
George Avatar answered Oct 05 '22 05:10

George