Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing)

Consider the following code (p is of type unsigned char* and bitmap->width is of some integer type, exactly which is unknown and depends on which version of some external library we're using):

for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x) {     *p++ = 0xAA;     *p++ = 0xBB;     *p++ = 0xCC; } 

Is it worth optimizing it [..]

Could there be a case where this could yield more efficient results by writing:

unsigned width(static_cast<unsigned>(bitmap->width)); for (unsigned x = 0;  x < width;  ++x) {     *p++ = 0xAA;     *p++ = 0xBB;     *p++ = 0xCC; } 

... or is this trivial for the compiler to optimize?

What would you consider to be "better" code?

Note from editor (Ike): for those wondering about the strikeout text, the original question, as phrased, was dangerously close to off-topic territory and was very close to being closed in spite of positive feedback. These have been stricken out. Yet please do not punish the answerers who addressed these stricken sections of the question.

like image 553
Yaron Cohen-Tal Avatar asked Nov 24 '15 16:11

Yaron Cohen-Tal


People also ask

How do I disable compiler optimization?

Use the command-line option -O0 (-[capital o][zero]) to disable optimization, and -S to get assembly file. Look here to see more gcc command-line options.

Which of the following C library is used for optimizing programs for speed and performance?

For nearly all situations, the library qsort function is speedy enough to make implementation of your own sort algorithm unnecessary. Focus your optimization efforts on speeding up the comparison function.

What is Ofast?

-Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math , -fallow-store-data-races and the Fortran-specific -fstack-arrays , unless -fmax-stack-var-size is specified, and -fno-protect-parens .

What is gcc O3?

gcc -o writes the build output to an output file. gcc -O sets the compiler's optimization level.


2 Answers

At first glance, I thought the compiler could generate equivalent assembly for both versions with optimization flags activated. When I checked it, I was surprised to see the result:

Source unoptimized.cpp

note: this code is not meant to be executed.

struct bitmap_t {     long long width; } bitmap;  int main(int argc, char** argv) {     for (unsigned x = 0 ; x < static_cast<unsigned>(bitmap.width) ; ++x)     {         argv[x][0] = '\0';     }     return 0; } 

Source optimized.cpp

note: this code is not meant to be executed.

struct bitmap_t {     long long width; } bitmap;  int main(int argc, char** argv) {     const unsigned width = static_cast<unsigned>(bitmap.width);     for (unsigned x = 0 ; x < width ; ++x)     {         argv[x][0] = '\0';     }     return 0; } 

Compilation

  • $ g++ -s -O3 unoptimized.cpp
  • $ g++ -s -O3 optimized.cpp

Assembly (unoptimized.s)

    .file   "unoptimized.cpp"     .text     .p2align 4,,15 .globl main     .type   main, @function main: .LFB0:     .cfi_startproc     .cfi_personality 0x3,__gxx_personality_v0     movl    bitmap(%rip), %eax     testl   %eax, %eax     je  .L2     xorl    %eax, %eax     .p2align 4,,10     .p2align 3 .L3:     mov %eax, %edx     addl    $1, %eax     movq    (%rsi,%rdx,8), %rdx     movb    $0, (%rdx)     cmpl    bitmap(%rip), %eax     jb  .L3 .L2:     xorl    %eax, %eax     ret     .cfi_endproc .LFE0:     .size   main, .-main .globl bitmap     .bss     .align 8     .type   bitmap, @object     .size   bitmap, 8 bitmap:     .zero   8     .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"     .section    .note.GNU-stack,"",@progbits 

Assembly (optimized.s)

    .file   "optimized.cpp"     .text     .p2align 4,,15 .globl main     .type   main, @function main: .LFB0:     .cfi_startproc     .cfi_personality 0x3,__gxx_personality_v0     movl    bitmap(%rip), %eax     testl   %eax, %eax     je  .L2     subl    $1, %eax     leaq    8(,%rax,8), %rcx     xorl    %eax, %eax     .p2align 4,,10     .p2align 3 .L3:     movq    (%rsi,%rax), %rdx     addq    $8, %rax     cmpq    %rcx, %rax     movb    $0, (%rdx)     jne .L3 .L2:     xorl    %eax, %eax     ret     .cfi_endproc .LFE0:     .size   main, .-main .globl bitmap     .bss     .align 8     .type   bitmap, @object     .size   bitmap, 8 bitmap:     .zero   8     .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"     .section    .note.GNU-stack,"",@progbits 

diff

$ diff -uN unoptimized.s optimized.s --- unoptimized.s   2015-11-24 16:11:55.837922223 +0000 +++ optimized.s 2015-11-24 16:12:02.628922941 +0000 @@ -1,4 +1,4 @@ -   .file   "unoptimized.cpp" +   .file   "optimized.cpp"     .text     .p2align 4,,15  .globl main @@ -10,16 +10,17 @@     movl    bitmap(%rip), %eax     testl   %eax, %eax     je  .L2 +   subl    $1, %eax +   leaq    8(,%rax,8), %rcx     xorl    %eax, %eax     .p2align 4,,10     .p2align 3  .L3: -   mov %eax, %edx -   addl    $1, %eax -   movq    (%rsi,%rdx,8), %rdx +   movq    (%rsi,%rax), %rdx +   addq    $8, %rax +   cmpq    %rcx, %rax     movb    $0, (%rdx) -   cmpl    bitmap(%rip), %eax -   jb  .L3 +   jne .L3  .L2:     xorl    %eax, %eax     ret 

The generated assembly for the optimized version does actually load (lea) the width constant unlike the unoptimized version which computes the width offset at each iteration (movq).

When I'll get time, I eventually post some benchmark on that. Good question.

like image 64
YSC Avatar answered Sep 21 '22 20:09

YSC


There is actually insufficient information from your code snippet to be able to tell, and the one thing that I can think of is aliasing. From our point of view, it's pretty clear that you don't want p and bitmap to point to the same location in memory, but the compiler doesn't know that and (because p is of type char*) the compiler has to make this code work even if p and bitmap overlap.

This means in this case that if the loop changes bitmap->width through the pointer p then that has to be seen when re-reading bitmap->width later on, which in turn means that storing it in a local variable would be illegal.

That being said, I believe some compilers will actually sometimes generate two versions of the same code (I have seen circumstantial evidence of this, but never directly sought out information on what the compiler is doing in this case), and quickly check if the pointers alias and run the faster code if it determines it's okay to.

That being said, I stand by my comment about simply measuring the performance of the two versions, my money is on not seeing any consistent performance difference between the two versions of the code.

In my opinion, questions like these are okay if your purpose is to learn about compiler optimization theories and techniques, but is a waste of time (a useless micro-optimization) if your end goal here is to make the program run faster.

like image 21
SirGuy Avatar answered Sep 21 '22 20:09

SirGuy