Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using this pointer causes strange deoptimization in hot loop

I recently came across a strange deoptimization (or rather missed optimization opportunity).

Consider this function for efficient unpacking of arrays of 3-bit integers to 8-bit integers. It unpacks 16 ints in each loop iteration:

void unpack3bit(uint8_t* target, char* source, int size) {
   while(size > 0){
      uint64_t t = *reinterpret_cast<uint64_t*>(source);
      target[0] = t & 0x7;
      target[1] = (t >> 3) & 0x7;
      target[2] = (t >> 6) & 0x7;
      target[3] = (t >> 9) & 0x7;
      target[4] = (t >> 12) & 0x7;
      target[5] = (t >> 15) & 0x7;
      target[6] = (t >> 18) & 0x7;
      target[7] = (t >> 21) & 0x7;
      target[8] = (t >> 24) & 0x7;
      target[9] = (t >> 27) & 0x7;
      target[10] = (t >> 30) & 0x7;
      target[11] = (t >> 33) & 0x7;
      target[12] = (t >> 36) & 0x7;
      target[13] = (t >> 39) & 0x7;
      target[14] = (t >> 42) & 0x7;
      target[15] = (t >> 45) & 0x7;
      source+=6;
      size-=6;
      target+=16;
   }
}

Here is the generated assembly for parts of the code:

 ...
 367:   48 89 c1                mov    rcx,rax
 36a:   48 c1 e9 09             shr    rcx,0x9
 36e:   83 e1 07                and    ecx,0x7
 371:   48 89 4f 18             mov    QWORD PTR [rdi+0x18],rcx
 375:   48 89 c1                mov    rcx,rax
 378:   48 c1 e9 0c             shr    rcx,0xc
 37c:   83 e1 07                and    ecx,0x7
 37f:   48 89 4f 20             mov    QWORD PTR [rdi+0x20],rcx
 383:   48 89 c1                mov    rcx,rax
 386:   48 c1 e9 0f             shr    rcx,0xf
 38a:   83 e1 07                and    ecx,0x7
 38d:   48 89 4f 28             mov    QWORD PTR [rdi+0x28],rcx
 391:   48 89 c1                mov    rcx,rax
 394:   48 c1 e9 12             shr    rcx,0x12
 398:   83 e1 07                and    ecx,0x7
 39b:   48 89 4f 30             mov    QWORD PTR [rdi+0x30],rcx
 ...

It looks quite efficent. Simply a shift right followed by an and, and then a store to the target buffer. But now, look what happens when I change the function to a method in a struct:

struct T{
   uint8_t* target;
   char* source;
   void unpack3bit( int size);
};

void T::unpack3bit(int size) {
        while(size > 0){
           uint64_t t = *reinterpret_cast<uint64_t*>(source);
           target[0] = t & 0x7;
           target[1] = (t >> 3) & 0x7;
           target[2] = (t >> 6) & 0x7;
           target[3] = (t >> 9) & 0x7;
           target[4] = (t >> 12) & 0x7;
           target[5] = (t >> 15) & 0x7;
           target[6] = (t >> 18) & 0x7;
           target[7] = (t >> 21) & 0x7;
           target[8] = (t >> 24) & 0x7;
           target[9] = (t >> 27) & 0x7;
           target[10] = (t >> 30) & 0x7;
           target[11] = (t >> 33) & 0x7;
           target[12] = (t >> 36) & 0x7;
           target[13] = (t >> 39) & 0x7;
           target[14] = (t >> 42) & 0x7;
           target[15] = (t >> 45) & 0x7;
           source+=6;
           size-=6;
           target+=16;
        }
}

I thought the generated assembly should be quite the same, but it isn't. Here is a part of it:

...
 2b3:   48 c1 e9 15             shr    rcx,0x15
 2b7:   83 e1 07                and    ecx,0x7
 2ba:   88 4a 07                mov    BYTE PTR [rdx+0x7],cl
 2bd:   48 89 c1                mov    rcx,rax
 2c0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2c3:   48 c1 e9 18             shr    rcx,0x18
 2c7:   83 e1 07                and    ecx,0x7
 2ca:   88 4a 08                mov    BYTE PTR [rdx+0x8],cl
 2cd:   48 89 c1                mov    rcx,rax
 2d0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2d3:   48 c1 e9 1b             shr    rcx,0x1b
 2d7:   83 e1 07                and    ecx,0x7
 2da:   88 4a 09                mov    BYTE PTR [rdx+0x9],cl
 2dd:   48 89 c1                mov    rcx,rax
 2e0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2e3:   48 c1 e9 1e             shr    rcx,0x1e
 2e7:   83 e1 07                and    ecx,0x7
 2ea:   88 4a 0a                mov    BYTE PTR [rdx+0xa],cl
 2ed:   48 89 c1                mov    rcx,rax
 2f0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 ...

As you see, we introduced an additional redundant load from memory before each shift (mov rdx,QWORD PTR [rdi]). It seems like the target pointer (which is now a member instead of a local variable) has to be always reloaded before storing into it. This slows down the code considerably (around 15% in my measurements).

First I thought maybe the C++ memory model enforces that a member pointer may not be stored in a register but has to be reloaded, but this seemed like an awkward choice, as it would make a lot of viable optimizations impossible. So I was very surprised that the compiler did not store target in a register here.

I tried caching the member pointer myself into a local variable:

void T::unpack3bit(int size) {
    while(size > 0){
       uint64_t t = *reinterpret_cast<uint64_t*>(source);
       uint8_t* target = this->target; // << ptr cached in local variable
       target[0] = t & 0x7;
       target[1] = (t >> 3) & 0x7;
       target[2] = (t >> 6) & 0x7;
       target[3] = (t >> 9) & 0x7;
       target[4] = (t >> 12) & 0x7;
       target[5] = (t >> 15) & 0x7;
       target[6] = (t >> 18) & 0x7;
       target[7] = (t >> 21) & 0x7;
       target[8] = (t >> 24) & 0x7;
       target[9] = (t >> 27) & 0x7;
       target[10] = (t >> 30) & 0x7;
       target[11] = (t >> 33) & 0x7;
       target[12] = (t >> 36) & 0x7;
       target[13] = (t >> 39) & 0x7;
       target[14] = (t >> 42) & 0x7;
       target[15] = (t >> 45) & 0x7;
       source+=6;
       size-=6;
       this->target+=16;
    }
}

This code also yields the "good" assembler without additional stores. So my guess is: The compiler is not allowed to hoist the load of a member pointer of a struct, so such a "hot pointer" should always be stored in a local variable.

  • So, why is the compiler unable to optimize out these loads?
  • Is it the C++ memory model that forbids this? Or is it simply a shortcoming of my compiler?
  • Is my guess correct or what is the exact reason why the optimization can't be performed?

The compiler in use was g++ 4.8.2-19ubuntu1 with -O3 optimization. I also tried clang++ 3.4-1ubuntu3 with similar results: Clang is even able to vectorize the method with the local target pointer. However, using the this->target pointer yields the same result: An extra load of the pointer before each store.

I checked the assembler of some similar methods and the result is the same: It seems that a member of this always has to be reloaded before a store, even if such a load could simply be hoisted outside the loop. I will have to rewrite a lot of code to get rid of these additional stores, mainly by caching the pointer myself into a local variable that is declared above the hot code. But I always thought fiddling with such details as caching a pointer in a local variable would surely qualify for premature optimization in these days where compilers have gotten so clever. But it seems I am wrong here. Caching a member pointer in a hot loop seems to be a necessary manual optimization technique.

like image 439
gexicide Avatar asked Oct 10 '14 08:10

gexicide


3 Answers

Pointer aliasing seems to be the problem, ironically between this and this->target. The compiler is taking into account the rather obscene possibility that you initialized:

this->target = &this

In that case, writing to this->target[0] would alter the contents of this (and thus, this->target).

The memory aliasing problem is not restricted to the above. In principle, any use of this->target[XX] given an (in)appropriate value of XX might point to this.

I am better versed in C, where this can be remedied by declaring pointer variables with the __restrict__ keyword.

like image 192
Peter Boncz Avatar answered Nov 12 '22 23:11

Peter Boncz


Strict aliasing rules allows char* to alias any other pointer. So this->target may alias with this, and in your code method, the first part of the code,

target[0] = t & 0x7;
target[1] = (t >> 3) & 0x7;
target[2] = (t >> 6) & 0x7;

is in fact

this->target[0] = t & 0x7;
this->target[1] = (t >> 3) & 0x7;
this->target[2] = (t >> 6) & 0x7;

as this may be modified when you modify this->target content.

Once this->target is cached into a local variable, the alias is no longer possible with the local variable.

like image 36
Jarod42 Avatar answered Nov 13 '22 00:11

Jarod42


The issue here is strict aliasing which says that we are allowed to alias through a char* and so that prevents compiler optimization in your case. We are not allowed to alias through a pointer of a different type which would be undefined behavior, normally on SO we see this problem which is users attempting to alias through incompatible pointer types.

It would seem reasonable to implement uint8_t as a unsigned char and if we look at cstdint on Coliru it includes stdint.h which typedefs uint8_t as follows:

typedef unsigned char       uint8_t;

if you used another non-char type then the compiler should be able to optimize.

This is covered in the draft C++ standard section 3.10 Lvalues and rvalues which says:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined

and includes the following bullet:

  • a char or unsigned char type.

Note, I posted a comment on possible work arounds in a question that asks When is uint8_t ≠ unsigned char? and the recommendation was:

The trivial workaround, however, is to use the restrict keyword, or to copy the pointer to a local variable whose address is never taken so that the compiler does not need to worry about whether the uint8_t objects can alias it.

Since C++ does not support the restrict keyword you have to rely on compiler extension, for example gcc uses __restrict__ so this is not totally portable but the other suggestion should be.

like image 26
Shafik Yaghmour Avatar answered Nov 13 '22 00:11

Shafik Yaghmour