Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency on a 64-bit platform: pointer vs 32-bit array indexing

In one of his keynote, Andrei Alexandrescu, suggests that, on a 64-bit platform, using 32-bit array indexing is faster than using raw pointer:

Page 16: http://www.slideshare.net/andreialexandrescu1/three-optimization-tips-for-c-15708507

On his Facebook account, he is more precise and says: "Prefer array indexing to pointers (this one seems to reverse every ten years).".

I have tried many things to find a difference, but I have not managed to build any program that shows this difference. Knowing Andrei, I would not be surprised that the difference is no more than a few percent, but I would be glad if anyone find such an example.

Here is a test I have done. I choose n = 5000, large enough to get a decent timing and small enough so that everything fits in the L1 cache. I loop a few times so that the CPU frequency rises.

#include <iostream>
#include <chrono>

int main(int argc, const char* argv[]) {
  const int n{5000};
  int* p{new int[n]};

  // Warm up the cache
  for (int i{0}; i < n; i++) {
    p[i] += 1;
  }

  for (int j{0}; j < 5; j++) {
    {
      auto start_pointer = std::chrono::high_resolution_clock::now();
      for (int* q{p}; q != p + n; ++q) {
        ++(*q);
      }
      auto end_pointer = std::chrono::high_resolution_clock::now();
      auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>(
                              end_pointer - start_pointer)
                              .count();
      std::cout << " Pointer: " << time_pointer << std::endl;
    }

    {
      auto start_pointer = std::chrono::high_resolution_clock::now();
      for (int* q{p}; q != p + n; ++q) {
        ++(*q);
      }
      auto end_pointer = std::chrono::high_resolution_clock::now();
      auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>(
                              end_pointer - start_pointer)
                              .count();
      std::cout << " Pointer: " << time_pointer << std::endl;
    }

    {
      auto start_index_32 = std::chrono::high_resolution_clock::now();
      for (int i{0}; i < n; i++) {
        p[i] += 1;
      }
      auto end_index_32 = std::chrono::high_resolution_clock::now();
      auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_32 - start_index_32)
                               .count();
      std::cout << "Index 32: " << time_index_32 << std::endl;
    }

    {
      auto start_index_32 = std::chrono::high_resolution_clock::now();
      for (int i{0}; i < n; i++) {
        p[i] += 1;
      }
      auto end_index_32 = std::chrono::high_resolution_clock::now();
      auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_32 - start_index_32)
                               .count();
      std::cout << "Index 32: " << time_index_32 << std::endl;
    }

    {
      const std::size_t n_64{n};
      auto start_index_64 = std::chrono::high_resolution_clock::now();
      for (std::size_t i{0}; i < n_64; i++) {
        p[i] += 1;
      }
      auto end_index_64 = std::chrono::high_resolution_clock::now();
      auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_64 - start_index_64)
                               .count();
      std::cout << "Index 64: " << time_index_64 << std::endl;
    }

    {
      const std::size_t n_64{n};
      auto start_index_64 = std::chrono::high_resolution_clock::now();
      for (std::size_t i{0}; i < n_64; i++) {
        p[i] += 1;
      }
      auto end_index_64 = std::chrono::high_resolution_clock::now();
      auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_64 - start_index_64)
                               .count();
      std::cout << "Index 64: " << time_index_64 << std::endl;
    }
    std::cout << std::endl;
  }

  delete[] p;

  return 0;
}

Here is the kind of result I get on my machine (core i7). All I get is "noise".

 Pointer: 883
 Pointer: 485
Index 32: 436
Index 32: 380
Index 64: 372
Index 64: 429

 Pointer: 330
 Pointer: 316
Index 32: 336
Index 32: 321
Index 64: 337
Index 64: 318

 Pointer: 311
 Pointer: 314
Index 32: 318
Index 32: 319
Index 64: 316
Index 64: 301

 Pointer: 306
 Pointer: 325
Index 32: 323
Index 32: 313
Index 64: 318
Index 64: 305

 Pointer: 311
 Pointer: 319
Index 32: 313
Index 32: 324
Index 64: 315
Index 64: 303
like image 686
InsideLoop Avatar asked Dec 17 '15 22:12

InsideLoop


People also ask

Is pointer arithmetic faster than array indexing?

Pointer arithmetic is slightly faster (about %10) than array indexing.

How big is a pointer in 64-bit?

The size of the character pointer is 8 bytes. Note: This code is executed on a 64-bit processor.

Is a pointer 64-bit?

In 64-bit data models, pointer sizes are always 64 bits.


3 Answers

The problem with low-level advice like this (even coming from Andrei Alexandrescu) is that it ignores the fact that compilers optimize.

Modern compilers optimize so aggressively (and, in general, successfully) that it really becomes a mug's game to try to second-guess them. On the whole, writing clear, readable code will help you, your colleagues and your compilers analyze the code. And I honestly believe that is the best general advice that can be given.

One of the well-known optimizations which modern compilers use is the conversion between index- and pointer-based loops. In the particular case of your benchmark, with most optimization settings, gcc will compile both the pointer-based and the 32-bit-index-based loop to the same assembler output.

In the following, I replaced the chrono stuff with ++sentry where sentry is a volatile in order to reduce the code size. The assembly output corresponds to:

for (int* q{p}; q != p + n; ++q) ++(*q);
++sentry;
for (int i{0}; i < n; i++) p[i] += 1;

Compile with -O2, this produced the following: (%rdi and %ebp were still initialized from the loop which populated p)

        movq    %rdi, %rdx
        cmpq    %rcx, %rdi
        je      .L10
.L16:
        addl    $1, (%rdx)
        addq    $4, %rdx
        cmpq    %rcx, %rdx
        jne     .L16
.L10:
        movl    sentry(%rip), %eax
        movq    %rdi, %rdx
        addl    $1, %eax
        movl    %eax, sentry(%rip)
        testl   %ebp, %ebp
        jle     .L8
.L14:
        addl    $1, (%rdx)
        addq    $4, %rdx
        cmpq    %rdx, %rsi
        jne     .L14
.L8:

You can see that there is no difference at all between the loops at .L16 and .L14.

Different optimization settings produce different results, of course. With -O3 the loops are vectorized using SIMD instructions and Duff's device, but again are almost identical. clang does this optimization at -O2

None of that denies the point being made, which is that the compiler may need to work harder to prove that a pointer which is being written through cannot modify arbitrary memory.

But in this case, as in many cases, the loop index is a local variable and the loop is simple enough that the compiler can fully analyze it, thus allowing strength reduction, unrolling, and vectorization; whether the control variable is a pointer or an index is then irrelevant.


A more interesting example (possibly) is a loop over two arrays where the base elements are different sizes. Given the following two functions:

void d2f_ptr(float* out, const double* in, int n) {
  for (auto lim = out + n; out < lim;) *out++ = *in++;
}
void d2f_idx(float out[], const double in[], int n) {
  for (int i = 0; i < n; ++i) out[i] = in[i];
}

gcc (v5.3.0, -O2) does produce different loops and the index-based loop is one instruction shorter:

d2f_ptr(float*, double const*, int):    d2f_idx(float*, double const*, int):
        movslq  %edx, %rdx                      xorl    %eax, %eax
        leaq    (%rdi,%rdx,4), %rax             testl   %edx, %edx
        cmpq    %rax, %rdi                      jle     .L16
        jnb     .L11
.L15:                                   .L20:
        addq    $4, %rdi                        pxor    %xmm0, %xmm0
        addq    $8, %rsi                        cvtsd2ss (%rsi,%rax,8), %xmm0
        pxor    %xmm0, %xmm0                    movss   %xmm0, (%rdi,%rax,4)
        cvtsd2ss -8(%rsi), %xmm0                addq    $1, %rax
        movss   %xmm0, -4(%rdi)
        cmpq    %rdi, %rax                      cmpl    %eax, %edx
        ja      .L15                            jg      .L20
.L11:                                   .L16:
        ret                                     ret

But change the double and float to objects whose sizes no longer permit the use of the Intel chip's indexed addressing mode, and the compiler once again converts the index-based code to a pointer-based variant.

Here the code is essentially the same as before, but the double has been padded to 48 bytes:

struct Big { double val; char padding[40]; };
struct Small {
  float val;
  Small& operator=(const Big& other) {
    val = other.val;
    return *this;
  }
};

d2f_ptr(Small*, Big const*, int):       d2f_idx(Small*, Big const*, int):
        movslq  %edx, %rdx                      testl   %edx, %edx
        leaq    (%rdi,%rdx,4), %rax             jle     .L26
        cmpq    %rax, %rdi                      leal    -1(%rdx), %eax
        jnb     .L21                            leaq    4(%rdi,%rax,4), %rax
.L25:                                   .L29:
        addq    $48, %rsi                       pxor    %xmm0, %xmm0
        addq    $4, %rdi                        addq    $4, %rdi
        pxor    %xmm0, %xmm0                    cvtsd2ss (%rsi), %xmm0
        cvtsd2ss -48(%rsi), %xmm0               addq    $48, %rsi
        movss   %xmm0, -4(%rdi)                 movss   %xmm0, -4(%rdi)
        cmpq    %rdi, %rax                      cmpq    %rax, %rdi
        ja      .L25                            jne     .L29
.L21:                                   .L26:
        ret                                     ret

It's possibly worth adding that for compilers, it is not necessarily more difficult to analyze which object a particular pointer write will modify. [Edited: There was a quote from Alexandrescu here, but it wasn't as relevant as I thought, so I removed it leaving this section to be mostly a strawman.]

In fact, if a pointer is only directly assigned to once, and all other modifications are through increment and decrement operations (including += and -=), then the compiler is totally within its rights to assume that the pointer always points within the same object. If some additive modification of the pointer were to overshoot into some other object, that would be Undefined Behaviour and the compiler can discard that possibility. It's easy enough to track assign and inc/dec operations in a flow graph, so in cases where the pointer could have been replaced with an index expression, it is quite possible for a compiler to figure that out and thus know that other objects are not being randomly mutated by writes through the pointer.

like image 176
rici Avatar answered Oct 19 '22 04:10

rici


His (Andrei Alexandrescu) reasoning seems to be based on the fact that using a register for a pointer variable is generally harder for compiler since a pointer could be pointing a to global data. But I don't see anything specific to 32-bit array indexing (To my reading, the slide wasn't quite clear if he was actually referring to 32-bit arrays or array indexing 32-bit systems)

From the horse's mouth: (yes, it's a link to his Facebook account :)

Minimize array writes

To be faster, code should reduce the number of array writes, and more generally, writes through pointers.

On modern machines with large register files and ample register renaming hardware, you can assume that most named individual variables (numbers, pointers) end up sitting in registers. Operating with registers is fast and plays into the strengths of the hardware setup. Even when data dependencies--a major enemy of instruction--level parallelism - come into play, CPUs have special hardware dedicated to managing various dependency patterns. Operating with registers (i.e. named variables) is betting on the house. Do it.

In contrast, array operations (and general indirect accesses) are less natural across the entire compiler-processor-cache hierarchy. Save for a few obvious patterns, array accesses are not registered. Also, whenever pointers are involved, the compiler must assume the pointers could point to global data, meaning any function call may change pointed-to data arbitrarily. And of array operations, array writes are the worst of the pack. Given that all traffic with memory is done at cache-line granularity, writing one word to memory is essentially a cache line read followed by a cache line write. So given that to a good extent array reads are inevitable anyway, this piece of advice boils down to "avoid array writes wherever possible.

He also seems to suggest, it's general suggestion rather than it's always faster to use array indexing(from the same post):

A few good, but less known, things to do for fast code:

Prefer static linking and position-dependent code (as opposed to PIC, position-independent code).
Prefer 64-bit code and 32-bit data.
Prefer array indexing to pointers (this one seems to reverse every ten years).
Prefer regular memory access patterns. Minimize control flow.
Avoid data dependencies.

like image 11
P.P Avatar answered Oct 19 '22 02:10

P.P


I wrote an email to Andrei Alexandrescu and he was kind enough to reply. Here is his explanation:

"In order for the speedup to be visible, you need to exploit the ALU's capability of running either two 32-bit operations or one 64-bit operation in one cycle. Not every benchmark will show a speedup."

I understand it as "SIMD instructions process more data per cycle with 32-bit data than 64-bit data". I have yet to find a benchmark (that doesn't contain any array of integers) where it makes a difference. But I assume it's going to be difficult. Andrei used to work for Facebook where every single percentage was worth getting.

like image 8
InsideLoop Avatar answered Oct 19 '22 03:10

InsideLoop