In one of his keynote, Andrei Alexandrescu, suggests that, on a 64-bit platform, using 32-bit array indexing is faster than using raw pointer: Page 16: http://www.slideshare.net/andreialexandrescu1/three-optimization-tips-for-c-15708507 On his Facebook account, he is more precise and says: "Prefer array indexing to pointers (this one seems to reverse every ten years).". I have tried many things to find a difference, but I have not managed to build any program that shows this difference. Knowing Andrei, I would not be surprised that the difference is no more than a few percent, but I would be glad if anyone find such an example. Here is a test I have done. I choose n = 5000, large enough to get a decent timing and small enough so that everything fits in the L1 cache. I loop a few times so that the CPU frequency rises. <pre class="prettyprint"><code>#include <iostream> #include <chrono> int main(int argc, const char* argv[]) { const int n{5000}; int* p{new int[n]}; // Warm up the cache for (int i{0}; i < n; i++) { p[i] += 1; } for (int j{0}; j < 5; j++) { { auto start_pointer = std::chrono::high_resolution_clock::now(); for (int* q{p}; q != p + n; ++q) { ++(*q); } auto end_pointer = std::chrono::high_resolution_clock::now(); auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>( end_pointer - start_pointer) .count(); std::cout << " Pointer: " << time_pointer << std::endl; } { auto start_pointer = std::chrono::high_resolution_clock::now(); for (int* q{p}; q != p + n; ++q) { ++(*q); } auto end_pointer = std::chrono::high_resolution_clock::now(); auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>( end_pointer - start_pointer) .count(); std::cout << " Pointer: " << time_pointer << std::endl; } { auto start_index_32 = std::chrono::high_resolution_clock::now(); for (int i{0}; i < n; i++) { p[i] += 1; } auto end_index_32 = std::chrono::high_resolution_clock::now(); auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>( end_index_32 - start_index_32) .count(); std::cout << "Index 32: " << time_index_32 << std::endl; } { auto start_index_32 = std::chrono::high_resolution_clock::now(); for (int i{0}; i < n; i++) { p[i] += 1; } auto end_index_32 = std::chrono::high_resolution_clock::now(); auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>( end_index_32 - start_index_32) .count(); std::cout << "Index 32: " << time_index_32 << std::endl; } { const std::size_t n_64{n}; auto start_index_64 = std::chrono::high_resolution_clock::now(); for (std::size_t i{0}; i < n_64; i++) { p[i] += 1; } auto end_index_64 = std::chrono::high_resolution_clock::now(); auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>( end_index_64 - start_index_64) .count(); std::cout << "Index 64: " << time_index_64 << std::endl; } { const std::size_t n_64{n}; auto start_index_64 = std::chrono::high_resolution_clock::now(); for (std::size_t i{0}; i < n_64; i++) { p[i] += 1; } auto end_index_64 = std::chrono::high_resolution_clock::now(); auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>( end_index_64 - start_index_64) .count(); std::cout << "Index 64: " << time_index_64 << std::endl; } std::cout << std::endl; } delete[] p; return 0; } </code></pre> Here is the kind of result I get on my machine (core i7). All I get is "noise". <pre class="prettyprint"><code> Pointer: 883 Pointer: 485 Index 32: 436 Index 32: 380 Index 64: 372 Index 64: 429 Pointer: 330 Pointer: 316 Index 32: 336 Index 32: 321 Index 64: 337 Index 64: 318 Pointer: 311 Pointer: 314 Index 32: 318 Index 32: 319 Index 64: 316 Index 64: 301 Pointer: 306 Pointer: 325 Index 32: 323 Index 32: 313 Index 64: 318 Index 64: 305 Pointer: 311 Pointer: 319 Index 32: 313 Index 32: 324 Index 64: 315 Index 64: 303 </code></pre>

The problem with low-level advice like this (even coming from Andrei Alexandrescu) is that it ignores the fact that compilers optimize. Modern compilers optimize so aggressively (and, in general, successfully) that it really becomes a mug's game to try to second-guess them. On the whole, writing clear, readable code will help you, your colleagues and your compilers analyze the code. And I honestly believe that is the best general advice that can be given. One of the well-known optimizations which modern compilers use is the conversion between index- and pointer-based loops. In the particular case of your benchmark, with most optimization settings, gcc will compile both the pointer-based and the 32-bit-index-based loop to the same assembler output. In the following, I replaced the chrono stuff with <code>++sentry</code> where <code>sentry</code> is a <code>volatile</code> in order to reduce the code size. The assembly output corresponds to: <pre class="prettyprint"><code>for (int* q{p}; q != p + n; ++q) ++(*q); ++sentry; for (int i{0}; i < n; i++) p[i] += 1; </code></pre> Compile with <code>-O2</code>, this produced the following: (<code>%rdi</code> and <code>%ebp</code> were still initialized from the loop which populated <code>p</code>) <pre class="prettyprint"><code> movq %rdi, %rdx cmpq %rcx, %rdi je .L10 .L16: addl $1, (%rdx) addq $4, %rdx cmpq %rcx, %rdx jne .L16 .L10: movl sentry(%rip), %eax movq %rdi, %rdx addl $1, %eax movl %eax, sentry(%rip) testl %ebp, %ebp jle .L8 .L14: addl $1, (%rdx) addq $4, %rdx cmpq %rdx, %rsi jne .L14 .L8: </code></pre> You can see that there is no difference at all between the loops at <code>.L16</code> and <code>.L14</code>. Different optimization settings produce different results, of course. With <code>-O3</code> the loops are vectorized using SIMD instructions and Duff's device, but again are almost identical. clang does this optimization at <code>-O2</code> None of that denies the point being made, which is that the compiler may need to work harder to prove that a pointer which is being written through cannot modify arbitrary memory. But in this case, as in many cases, the loop index is a local variable and the loop is simple enough that the compiler can fully analyze it, thus allowing strength reduction, unrolling, and vectorization; whether the control variable is a pointer or an index is then irrelevant. <hr> A more interesting example (possibly) is a loop over two arrays where the base elements are different sizes. Given the following two functions: <pre class="prettyprint"><code>void d2f_ptr(float* out, const double* in, int n) { for (auto lim = out + n; out < lim;) *out++ = *in++; } void d2f_idx(float out[], const double in[], int n) { for (int i = 0; i < n; ++i) out[i] = in[i]; } </code></pre> gcc (v5.3.0, -O2) does produce different loops and the index-based loop is one instruction shorter: <pre class="prettyprint"><code>d2f_ptr(float*, double const*, int): d2f_idx(float*, double const*, int): movslq %edx, %rdx xorl %eax, %eax leaq (%rdi,%rdx,4), %rax testl %edx, %edx cmpq %rax, %rdi jle .L16 jnb .L11 .L15: .L20: addq $4, %rdi pxor %xmm0, %xmm0 addq $8, %rsi cvtsd2ss (%rsi,%rax,8), %xmm0 pxor %xmm0, %xmm0 movss %xmm0, (%rdi,%rax,4) cvtsd2ss -8(%rsi), %xmm0 addq $1, %rax movss %xmm0, -4(%rdi) cmpq %rdi, %rax cmpl %eax, %edx ja .L15 jg .L20 .L11: .L16: ret ret </code></pre> But change the <code>double</code> and <code>float</code> to objects whose sizes no longer permit the use of the Intel chip's indexed addressing mode, and the compiler once again converts the index-based code to a pointer-based variant. Here the code is essentially the same as before, but the double has been padded to 48 bytes: <pre class="prettyprint"><code>struct Big { double val; char padding[40]; }; struct Small { float val; Small& operator=(const Big& other) { val = other.val; return *this; } }; d2f_ptr(Small*, Big const*, int): d2f_idx(Small*, Big const*, int): movslq %edx, %rdx testl %edx, %edx leaq (%rdi,%rdx,4), %rax jle .L26 cmpq %rax, %rdi leal -1(%rdx), %eax jnb .L21 leaq 4(%rdi,%rax,4), %rax .L25: .L29: addq $48, %rsi pxor %xmm0, %xmm0 addq $4, %rdi addq $4, %rdi pxor %xmm0, %xmm0 cvtsd2ss (%rsi), %xmm0 cvtsd2ss -48(%rsi), %xmm0 addq $48, %rsi movss %xmm0, -4(%rdi) movss %xmm0, -4(%rdi) cmpq %rdi, %rax cmpq %rax, %rdi ja .L25 jne .L29 .L21: .L26: ret ret </code></pre> <hr> It's possibly worth adding that for compilers, it is not necessarily more difficult to analyze which object a particular pointer write will modify. [Edited: There was a quote from Alexandrescu here, but it wasn't as relevant as I thought, so I removed it leaving this section to be mostly a strawman.] In fact, if a pointer is only directly assigned to once, and all other modifications are through increment and decrement operations (including <code>+=</code> and <code>-=</code>), then the compiler is totally within its rights to assume that the pointer always points within the same object. If some additive modification of the pointer were to overshoot into some other object, that would be Undefined Behaviour and the compiler can discard that possibility. It's easy enough to track assign and inc/dec operations in a flow graph, so in cases where the pointer could have been replaced with an index expression, it is quite possible for a compiler to figure that out and thus know that other objects are not being randomly mutated by writes through the pointer.

I wrote an email to Andrei Alexandrescu and he was kind enough to reply. Here is his explanation: "In order for the speedup to be visible, you need to exploit the ALU's capability of running either two 32-bit operations or one 64-bit operation in one cycle. Not every benchmark will show a speedup." I understand it as "SIMD instructions process more data per cycle with 32-bit data than 64-bit data". I have yet to find a benchmark (that doesn't contain any array of integers) where it makes a difference. But I assume it's going to be difficult. Andrei used to work for Facebook where every single percentage was worth getting.

Efficiency on a 64-bit platform: pointer vs 32-bit array indexing

Tags:

c++

performance

arrays

pointers

In one of his keynote, Andrei Alexandrescu, suggests that, on a 64-bit platform, using 32-bit array indexing is faster than using raw pointer:

Page 16: http://www.slideshare.net/andreialexandrescu1/three-optimization-tips-for-c-15708507

On his Facebook account, he is more precise and says: "Prefer array indexing to pointers (this one seems to reverse every ten years).".

I have tried many things to find a difference, but I have not managed to build any program that shows this difference. Knowing Andrei, I would not be surprised that the difference is no more than a few percent, but I would be glad if anyone find such an example.

Here is a test I have done. I choose n = 5000, large enough to get a decent timing and small enough so that everything fits in the L1 cache. I loop a few times so that the CPU frequency rises.

#include <iostream>
#include <chrono>

int main(int argc, const char* argv[]) {
  const int n{5000};
  int* p{new int[n]};

  // Warm up the cache
  for (int i{0}; i < n; i++) {
    p[i] += 1;
  }

  for (int j{0}; j < 5; j++) {
    {
      auto start_pointer = std::chrono::high_resolution_clock::now();
      for (int* q{p}; q != p + n; ++q) {
        ++(*q);
      }
      auto end_pointer = std::chrono::high_resolution_clock::now();
      auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>(
                              end_pointer - start_pointer)
                              .count();
      std::cout << " Pointer: " << time_pointer << std::endl;
    }

    {
      auto start_pointer = std::chrono::high_resolution_clock::now();
      for (int* q{p}; q != p + n; ++q) {
        ++(*q);
      }
      auto end_pointer = std::chrono::high_resolution_clock::now();
      auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>(
                              end_pointer - start_pointer)
                              .count();
      std::cout << " Pointer: " << time_pointer << std::endl;
    }

    {
      auto start_index_32 = std::chrono::high_resolution_clock::now();
      for (int i{0}; i < n; i++) {
        p[i] += 1;
      }
      auto end_index_32 = std::chrono::high_resolution_clock::now();
      auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_32 - start_index_32)
                               .count();
      std::cout << "Index 32: " << time_index_32 << std::endl;
    }

    {
      auto start_index_32 = std::chrono::high_resolution_clock::now();
      for (int i{0}; i < n; i++) {
        p[i] += 1;
      }
      auto end_index_32 = std::chrono::high_resolution_clock::now();
      auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_32 - start_index_32)
                               .count();
      std::cout << "Index 32: " << time_index_32 << std::endl;
    }

    {
      const std::size_t n_64{n};
      auto start_index_64 = std::chrono::high_resolution_clock::now();
      for (std::size_t i{0}; i < n_64; i++) {
        p[i] += 1;
      }
      auto end_index_64 = std::chrono::high_resolution_clock::now();
      auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_64 - start_index_64)
                               .count();
      std::cout << "Index 64: " << time_index_64 << std::endl;
    }

    {
      const std::size_t n_64{n};
      auto start_index_64 = std::chrono::high_resolution_clock::now();
      for (std::size_t i{0}; i < n_64; i++) {
        p[i] += 1;
      }
      auto end_index_64 = std::chrono::high_resolution_clock::now();
      auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_64 - start_index_64)
                               .count();
      std::cout << "Index 64: " << time_index_64 << std::endl;
    }
    std::cout << std::endl;
  }

  delete[] p;

  return 0;
}

Here is the kind of result I get on my machine (core i7). All I get is "noise".

 Pointer: 883
 Pointer: 485
Index 32: 436
Index 32: 380
Index 64: 372
Index 64: 429

 Pointer: 330
 Pointer: 316
Index 32: 336
Index 32: 321
Index 64: 337
Index 64: 318

 Pointer: 311
 Pointer: 314
Index 32: 318
Index 32: 319
Index 64: 316
Index 64: 301

 Pointer: 306
 Pointer: 325
Index 32: 323
Index 32: 313
Index 64: 318
Index 64: 305

 Pointer: 311
 Pointer: 319
Index 32: 313
Index 32: 324
Index 64: 315
Index 64: 303

686

asked Dec 17 '15 22:12

InsideLoop

3 Answers

The problem with low-level advice like this (even coming from Andrei Alexandrescu) is that it ignores the fact that compilers optimize.

Modern compilers optimize so aggressively (and, in general, successfully) that it really becomes a mug's game to try to second-guess them. On the whole, writing clear, readable code will help you, your colleagues and your compilers analyze the code. And I honestly believe that is the best general advice that can be given.

One of the well-known optimizations which modern compilers use is the conversion between index- and pointer-based loops. In the particular case of your benchmark, with most optimization settings, gcc will compile both the pointer-based and the 32-bit-index-based loop to the same assembler output.

In the following, I replaced the chrono stuff with ++sentry where sentry is a volatile in order to reduce the code size. The assembly output corresponds to:

for (int* q{p}; q != p + n; ++q) ++(*q);
++sentry;
for (int i{0}; i < n; i++) p[i] += 1;

Compile with -O2, this produced the following: (%rdi and %ebp were still initialized from the loop which populated p)

        movq    %rdi, %rdx
        cmpq    %rcx, %rdi
        je      .L10
.L16:
        addl    $1, (%rdx)
        addq    $4, %rdx
        cmpq    %rcx, %rdx
        jne     .L16
.L10:
        movl    sentry(%rip), %eax
        movq    %rdi, %rdx
        addl    $1, %eax
        movl    %eax, sentry(%rip)
        testl   %ebp, %ebp
        jle     .L8
.L14:
        addl    $1, (%rdx)
        addq    $4, %rdx
        cmpq    %rdx, %rsi
        jne     .L14
.L8:

You can see that there is no difference at all between the loops at .L16 and .L14.

Different optimization settings produce different results, of course. With -O3 the loops are vectorized using SIMD instructions and Duff's device, but again are almost identical. clang does this optimization at -O2

None of that denies the point being made, which is that the compiler may need to work harder to prove that a pointer which is being written through cannot modify arbitrary memory.

But in this case, as in many cases, the loop index is a local variable and the loop is simple enough that the compiler can fully analyze it, thus allowing strength reduction, unrolling, and vectorization; whether the control variable is a pointer or an index is then irrelevant.

A more interesting example (possibly) is a loop over two arrays where the base elements are different sizes. Given the following two functions:

void d2f_ptr(float* out, const double* in, int n) {
  for (auto lim = out + n; out < lim;) *out++ = *in++;
}
void d2f_idx(float out[], const double in[], int n) {
  for (int i = 0; i < n; ++i) out[i] = in[i];
}

gcc (v5.3.0, -O2) does produce different loops and the index-based loop is one instruction shorter:

d2f_ptr(float*, double const*, int):    d2f_idx(float*, double const*, int):
        movslq  %edx, %rdx                      xorl    %eax, %eax
        leaq    (%rdi,%rdx,4), %rax             testl   %edx, %edx
        cmpq    %rax, %rdi                      jle     .L16
        jnb     .L11
.L15:                                   .L20:
        addq    $4, %rdi                        pxor    %xmm0, %xmm0
        addq    $8, %rsi                        cvtsd2ss (%rsi,%rax,8), %xmm0
        pxor    %xmm0, %xmm0                    movss   %xmm0, (%rdi,%rax,4)
        cvtsd2ss -8(%rsi), %xmm0                addq    $1, %rax
        movss   %xmm0, -4(%rdi)
        cmpq    %rdi, %rax                      cmpl    %eax, %edx
        ja      .L15                            jg      .L20
.L11:                                   .L16:
        ret                                     ret

But change the double and float to objects whose sizes no longer permit the use of the Intel chip's indexed addressing mode, and the compiler once again converts the index-based code to a pointer-based variant.

Here the code is essentially the same as before, but the double has been padded to 48 bytes:

struct Big { double val; char padding[40]; };
struct Small {
  float val;
  Small& operator=(const Big& other) {
    val = other.val;
    return *this;
  }
};

d2f_ptr(Small*, Big const*, int):       d2f_idx(Small*, Big const*, int):
        movslq  %edx, %rdx                      testl   %edx, %edx
        leaq    (%rdi,%rdx,4), %rax             jle     .L26
        cmpq    %rax, %rdi                      leal    -1(%rdx), %eax
        jnb     .L21                            leaq    4(%rdi,%rax,4), %rax
.L25:                                   .L29:
        addq    $48, %rsi                       pxor    %xmm0, %xmm0
        addq    $4, %rdi                        addq    $4, %rdi
        pxor    %xmm0, %xmm0                    cvtsd2ss (%rsi), %xmm0
        cvtsd2ss -48(%rsi), %xmm0               addq    $48, %rsi
        movss   %xmm0, -4(%rdi)                 movss   %xmm0, -4(%rdi)
        cmpq    %rdi, %rax                      cmpq    %rax, %rdi
        ja      .L25                            jne     .L29
.L21:                                   .L26:
        ret                                     ret

It's possibly worth adding that for compilers, it is not necessarily more difficult to analyze which object a particular pointer write will modify. [Edited: There was a quote from Alexandrescu here, but it wasn't as relevant as I thought, so I removed it leaving this section to be mostly a strawman.]

In fact, if a pointer is only directly assigned to once, and all other modifications are through increment and decrement operations (including += and -=), then the compiler is totally within its rights to assume that the pointer always points within the same object. If some additive modification of the pointer were to overshoot into some other object, that would be Undefined Behaviour and the compiler can discard that possibility. It's easy enough to track assign and inc/dec operations in a flow graph, so in cases where the pointer could have been replaced with an index expression, it is quite possible for a compiler to figure that out and thus know that other objects are not being randomly mutated by writes through the pointer.

176

answered Oct 19 '22 04:10

rici

His (Andrei Alexandrescu) reasoning seems to be based on the fact that using a register for a pointer variable is generally harder for compiler since a pointer could be pointing a to global data. But I don't see anything specific to 32-bit array indexing (To my reading, the slide wasn't quite clear if he was actually referring to 32-bit arrays or array indexing 32-bit systems)

From the horse's mouth: (yes, it's a link to his Facebook account :)

Minimize array writes

To be faster, code should reduce the number of array writes, and more generally, writes through pointers.

On modern machines with large register files and ample register renaming hardware, you can assume that most named individual variables (numbers, pointers) end up sitting in registers. Operating with registers is fast and plays into the strengths of the hardware setup. Even when data dependencies--a major enemy of instruction--level parallelism - come into play, CPUs have special hardware dedicated to managing various dependency patterns. Operating with registers (i.e. named variables) is betting on the house. Do it.

In contrast, array operations (and general indirect accesses) are less natural across the entire compiler-processor-cache hierarchy. Save for a few obvious patterns, array accesses are not registered. Also, whenever pointers are involved, the compiler must assume the pointers could point to global data, meaning any function call may change pointed-to data arbitrarily. And of array operations, array writes are the worst of the pack. Given that all traffic with memory is done at cache-line granularity, writing one word to memory is essentially a cache line read followed by a cache line write. So given that to a good extent array reads are inevitable anyway, this piece of advice boils down to "avoid array writes wherever possible.

He also seems to suggest, it's general suggestion rather than it's always faster to use array indexing(from the same post):

A few good, but less known, things to do for fast code:

Prefer static linking and position-dependent code (as opposed to PIC, position-independent code).
Prefer 64-bit code and 32-bit data.
Prefer array indexing to pointers (this one seems to reverse every ten years).
Prefer regular memory access patterns. Minimize control flow.
Avoid data dependencies.

answered Oct 19 '22 02:10

P.P

I wrote an email to Andrei Alexandrescu and he was kind enough to reply. Here is his explanation:

"In order for the speedup to be visible, you need to exploit the ALU's capability of running either two 32-bit operations or one 64-bit operation in one cycle. Not every benchmark will show a speedup."

I understand it as "SIMD instructions process more data per cycle with 32-bit data than 64-bit data". I have yet to find a benchmark (that doesn't contain any array of integers) where it makes a difference. But I assume it's going to be difficult. Andrei used to work for Facebook where every single percentage was worth getting.

answered Oct 19 '22 03:10

InsideLoop

Related questions
                            
                                "constexpr if" vs "if" with optimizations - why is "constexpr" needed?
                            
                                C++ standard: dereferencing NULL pointer to get a reference? [duplicate]
                            
                                Documenting enum values with doxygen
                            
                                What is the motivation behind static polymorphism in C++?
                            
                                Which sorting algorithm is used by STL's list::sort()?
                            
                                How to detect win32 process creation/termination in c++
                            
                                How-to ensure that compiler optimizations don't introduce a security risk?
                            
                                length of va_list when using variable list arguments?
                            
                                How do I erase elements from STL containers?
                            
                                LINK : fatal error LNK1561: entry point must be defined ERROR IN VC++
                            
                                Isn't the template argument (the signature) of std::function part of its type?
                            
                                "&" meaning after variable type [duplicate]
                            
                                Is std::vector<T> a `user-defined type`?
                            
                                What is the difference between MinGW SEH and MinGW SJLJ?
                            
                                STL Rope - when and where to use
                            
                                C++11 std::function slower than virtual calls?
                            
                                how to convert filesystem path to string
                            
                                May a destructor be final?
                            
                                Manual for cross-compiling a C++ application from Linux to Windows?
                            
                                What is the C# equivalent of std::bitset of C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiency on a 64-bit platform: pointer vs 32-bit array indexing

Tags:

c++

performance

arrays

pointers

InsideLoop

People also ask

3 Answers

rici

P.P

InsideLoop

Recent Activity

Donate For Us