Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Micro optimize pointer + unsigned + 1

Hard as it may be to believe the construct p[u+1] occurs in several places in innermost loops of code I maintain such that getting the micro optimization of it right makes hours of difference in an operation that runs for days.

Typically *((p+u)+1) is most efficient. Sometimes *(p+(u+1)) is most efficient. Rarely *((p+1)+u) is best. (But usually an optimizer can convert *((p+1)+u) to *((p+u)+1) when the latter is better, and can't convert *(p+(u+1)) with either of the others).

p is a pointer and u is an unsigned. In the actual code at least one of them (more likely both) will already be in register(s) at the point the expression is evaluated. Those facts are critical to the point of my question.

In 32-bit (before my project dropped support for that) all three have exactly the same semantics and any half decent compiler simply picks the best of the three and the programmer never needs to care.

In these 64-bit uses, the programmer knows all three have the same semantics, but the compiler doesn't know. So far as the compiler knows, the decision of when to extend u from 32-bit to 64-bit can affect the result.

What is the cleanest way to tell the compiler that the semantics of all three are the same and the compiler should select the fastest of them?

In one Linux 64-bit compiler, I got nearly there with p[u+1L] which causes the compiler to select intelligently between the usually best *((p+u)+1) and the sometimes better *(p+( (long)(u) + 1) ). In the rare case *(p+(u+1)) was still better than the second of those, a little is lost.

Obviously, that does no good in 64-bit Windows. Now that we dropped 32-bit support, maybe p[u+1LL] is portable enough and good enough. But can I do better?

Note that using std::size_t instead of unsigned for u would eliminate this entire problem, but create a larger performance problem nearby. Casting u to std::size_t right there is almost good enough, and maybe the best I can do. But that is pretty verbose for an imperfect solution.

Simply coding (p+1)[u] makes a selection more likely to be optimal than p[u+1]. If the code were less templated and more stable, I could set them all to (p+1)[u] then profile then switch a few back to p[u+1]. But the templating tends to destroy that approach (A single source line appears in many places in the profile adding up to serious time, but not individually serious time).

Compilers that should be efficient for this are GCC, ICC and MSVC.

like image 463
JSF Avatar asked Dec 29 '15 13:12

JSF


1 Answers

The answer is inevitably compiler and target specific, but even if 1ULL is wider than a pointer on whatever target architecture, a good compiler should optimize it away. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? explains why a wider computation truncated to pointer width will give identical results as doing computation with pointer width in the first place. This is why compilers can optimize it away even on 32bit machines (or x86-64 with the x32 ABI) when 1ULL leads to promotion of the + operands to a 64bit type. (Or on some 64bit ABI for some architecture where long long is 128b).


1ULL looks optimal for 64bit, and for 32bit with clang. You don't care about 32bit anyway, but gcc wastes an instruction in the return p[u + 1ULL];. All the other cases are compiled to a single load with scaled-index+4+p addressing mode. So other than one compiler's optimization failure, 1ULL looks fine for 32bit as well. (I think it's unlikely that it's a clang bug and that optimization is illegal).

int v1ULL(std::uint32_t u) { return p[u + 1ULL]; }
//   ...  load u from the stack
//    add     eax, 1
//    mov     eax, DWORD PTR p[0+eax*4]

instead of

    mov     eax, DWORD PTR p[4+eax*4]

Interestingly, gcc 5.3 doesn't make this mistake when targeting the x32 ABI (long mode with 32bit pointers and a register-call ABI similar to SySV AMD64). It uses a 32bit address-size prefix to avoid using the upper 32b of edi.

Annoyingly, it still uses an address-size prefix when it could save a byte of machine code by using a 64bit effective address (when there's no chance of overflow/carry into the upper32 generating an address outside the low 4GiB). Passing the pointer by reference is a good example:

int x2   (char *&c) { return *c; }
//    mov     eax, DWORD PTR [edi]  ; upper32 of rax is zero
//    movsx   eax, BYTE PTR [eax]   ; could be byte [rax], saving one byte of machine code

Err, actually I forget. 32bit addresses might sign-extend to 64b, not zero-extend. If that's the case, it could have used movsx for the first instruction, too, but that would have cost a byte because movsx has a longer opcode than mov.

Anyway, x32 is still an interesting choice for pointer-heavy code that wants more registers and a nicer ABI, without the cache-miss hit of 8B pointers.


The 64bit asm has to zero the upper32 of the register holding the parameter (with mov edi,edi), but that goes away when inlining. Looking at godbolt output for tiny functions is a valid way to test this.

If we want to make doubly sure that the compiler isn't shooting itself in the foot and zeroing the upper32 when it should know it's already zero, we could make test functions with an arg passed by reference.

int v1ULL(const std::uint32_t &u) { return p[u + 1ULL]; }
//  mov     eax, DWORD PTR [rdi]
//  mov     eax, DWORD PTR p[4+rax*4]
like image 101
Peter Cordes Avatar answered Oct 28 '22 01:10

Peter Cordes