In one of his keynote, Andrei Alexandrescu, suggests that, on a 64-bit platform, using 32-bit array indexing is faster than using raw pointer:
Page 16: http://www.slideshare.net/andreialexandrescu1/three-optimization-tips-for-c-15708507
On his Facebook account, he is more precise and says: "Prefer array indexing to pointers (this one seems to reverse every ten years).".
I have tried many things to find a difference, but I have not managed to build any program that shows this difference. Knowing Andrei, I would not be surprised that the difference is no more than a few percent, but I would be glad if anyone find such an example.
Here is a test I have done. I choose n = 5000, large enough to get a decent timing and small enough so that everything fits in the L1 cache. I loop a few times so that the CPU frequency rises.
#include <iostream>
#include <chrono>
int main(int argc, const char* argv[]) {
const int n{5000};
int* p{new int[n]};
// Warm up the cache
for (int i{0}; i < n; i++) {
p[i] += 1;
}
for (int j{0}; j < 5; j++) {
{
auto start_pointer = std::chrono::high_resolution_clock::now();
for (int* q{p}; q != p + n; ++q) {
++(*q);
}
auto end_pointer = std::chrono::high_resolution_clock::now();
auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>(
end_pointer - start_pointer)
.count();
std::cout << " Pointer: " << time_pointer << std::endl;
}
{
auto start_pointer = std::chrono::high_resolution_clock::now();
for (int* q{p}; q != p + n; ++q) {
++(*q);
}
auto end_pointer = std::chrono::high_resolution_clock::now();
auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>(
end_pointer - start_pointer)
.count();
std::cout << " Pointer: " << time_pointer << std::endl;
}
{
auto start_index_32 = std::chrono::high_resolution_clock::now();
for (int i{0}; i < n; i++) {
p[i] += 1;
}
auto end_index_32 = std::chrono::high_resolution_clock::now();
auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>(
end_index_32 - start_index_32)
.count();
std::cout << "Index 32: " << time_index_32 << std::endl;
}
{
auto start_index_32 = std::chrono::high_resolution_clock::now();
for (int i{0}; i < n; i++) {
p[i] += 1;
}
auto end_index_32 = std::chrono::high_resolution_clock::now();
auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>(
end_index_32 - start_index_32)
.count();
std::cout << "Index 32: " << time_index_32 << std::endl;
}
{
const std::size_t n_64{n};
auto start_index_64 = std::chrono::high_resolution_clock::now();
for (std::size_t i{0}; i < n_64; i++) {
p[i] += 1;
}
auto end_index_64 = std::chrono::high_resolution_clock::now();
auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>(
end_index_64 - start_index_64)
.count();
std::cout << "Index 64: " << time_index_64 << std::endl;
}
{
const std::size_t n_64{n};
auto start_index_64 = std::chrono::high_resolution_clock::now();
for (std::size_t i{0}; i < n_64; i++) {
p[i] += 1;
}
auto end_index_64 = std::chrono::high_resolution_clock::now();
auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>(
end_index_64 - start_index_64)
.count();
std::cout << "Index 64: " << time_index_64 << std::endl;
}
std::cout << std::endl;
}
delete[] p;
return 0;
}
Here is the kind of result I get on my machine (core i7). All I get is "noise".
Pointer: 883
Pointer: 485
Index 32: 436
Index 32: 380
Index 64: 372
Index 64: 429
Pointer: 330
Pointer: 316
Index 32: 336
Index 32: 321
Index 64: 337
Index 64: 318
Pointer: 311
Pointer: 314
Index 32: 318
Index 32: 319
Index 64: 316
Index 64: 301
Pointer: 306
Pointer: 325
Index 32: 323
Index 32: 313
Index 64: 318
Index 64: 305
Pointer: 311
Pointer: 319
Index 32: 313
Index 32: 324
Index 64: 315
Index 64: 303
Pointer arithmetic is slightly faster (about %10) than array indexing.
The size of the character pointer is 8 bytes. Note: This code is executed on a 64-bit processor.
In 64-bit data models, pointer sizes are always 64 bits.
The problem with low-level advice like this (even coming from Andrei Alexandrescu) is that it ignores the fact that compilers optimize.
Modern compilers optimize so aggressively (and, in general, successfully) that it really becomes a mug's game to try to second-guess them. On the whole, writing clear, readable code will help you, your colleagues and your compilers analyze the code. And I honestly believe that is the best general advice that can be given.
One of the well-known optimizations which modern compilers use is the conversion between index- and pointer-based loops. In the particular case of your benchmark, with most optimization settings, gcc will compile both the pointer-based and the 32-bit-index-based loop to the same assembler output.
In the following, I replaced the chrono stuff with ++sentry
where sentry
is a volatile
in order to reduce the code size. The assembly output corresponds to:
for (int* q{p}; q != p + n; ++q) ++(*q);
++sentry;
for (int i{0}; i < n; i++) p[i] += 1;
Compile with -O2
, this produced the following: (%rdi
and %ebp
were still initialized from the loop which populated p
)
movq %rdi, %rdx
cmpq %rcx, %rdi
je .L10
.L16:
addl $1, (%rdx)
addq $4, %rdx
cmpq %rcx, %rdx
jne .L16
.L10:
movl sentry(%rip), %eax
movq %rdi, %rdx
addl $1, %eax
movl %eax, sentry(%rip)
testl %ebp, %ebp
jle .L8
.L14:
addl $1, (%rdx)
addq $4, %rdx
cmpq %rdx, %rsi
jne .L14
.L8:
You can see that there is no difference at all between the loops at .L16
and .L14
.
Different optimization settings produce different results, of course. With -O3
the loops are vectorized using SIMD instructions and Duff's device, but again are almost identical. clang does this optimization at -O2
None of that denies the point being made, which is that the compiler may need to work harder to prove that a pointer which is being written through cannot modify arbitrary memory.
But in this case, as in many cases, the loop index is a local variable and the loop is simple enough that the compiler can fully analyze it, thus allowing strength reduction, unrolling, and vectorization; whether the control variable is a pointer or an index is then irrelevant.
A more interesting example (possibly) is a loop over two arrays where the base elements are different sizes. Given the following two functions:
void d2f_ptr(float* out, const double* in, int n) {
for (auto lim = out + n; out < lim;) *out++ = *in++;
}
void d2f_idx(float out[], const double in[], int n) {
for (int i = 0; i < n; ++i) out[i] = in[i];
}
gcc (v5.3.0, -O2) does produce different loops and the index-based loop is one instruction shorter:
d2f_ptr(float*, double const*, int): d2f_idx(float*, double const*, int):
movslq %edx, %rdx xorl %eax, %eax
leaq (%rdi,%rdx,4), %rax testl %edx, %edx
cmpq %rax, %rdi jle .L16
jnb .L11
.L15: .L20:
addq $4, %rdi pxor %xmm0, %xmm0
addq $8, %rsi cvtsd2ss (%rsi,%rax,8), %xmm0
pxor %xmm0, %xmm0 movss %xmm0, (%rdi,%rax,4)
cvtsd2ss -8(%rsi), %xmm0 addq $1, %rax
movss %xmm0, -4(%rdi)
cmpq %rdi, %rax cmpl %eax, %edx
ja .L15 jg .L20
.L11: .L16:
ret ret
But change the double
and float
to objects whose sizes no longer permit the use of the Intel chip's indexed addressing mode, and the compiler once again converts the index-based code to a pointer-based variant.
Here the code is essentially the same as before, but the double has been padded to 48 bytes:
struct Big { double val; char padding[40]; };
struct Small {
float val;
Small& operator=(const Big& other) {
val = other.val;
return *this;
}
};
d2f_ptr(Small*, Big const*, int): d2f_idx(Small*, Big const*, int):
movslq %edx, %rdx testl %edx, %edx
leaq (%rdi,%rdx,4), %rax jle .L26
cmpq %rax, %rdi leal -1(%rdx), %eax
jnb .L21 leaq 4(%rdi,%rax,4), %rax
.L25: .L29:
addq $48, %rsi pxor %xmm0, %xmm0
addq $4, %rdi addq $4, %rdi
pxor %xmm0, %xmm0 cvtsd2ss (%rsi), %xmm0
cvtsd2ss -48(%rsi), %xmm0 addq $48, %rsi
movss %xmm0, -4(%rdi) movss %xmm0, -4(%rdi)
cmpq %rdi, %rax cmpq %rax, %rdi
ja .L25 jne .L29
.L21: .L26:
ret ret
It's possibly worth adding that for compilers, it is not necessarily more difficult to analyze which object a particular pointer write will modify. [Edited: There was a quote from Alexandrescu here, but it wasn't as relevant as I thought, so I removed it leaving this section to be mostly a strawman.]
In fact, if a pointer is only directly assigned to once, and all other modifications are through increment and decrement operations (including +=
and -=
), then the compiler is totally within its rights to assume that the pointer always points within the same object. If some additive modification of the pointer were to overshoot into some other object, that would be Undefined Behaviour and the compiler can discard that possibility. It's easy enough to track assign and inc/dec operations in a flow graph, so in cases where the pointer could have been replaced with an index expression, it is quite possible for a compiler to figure that out and thus know that other objects are not being randomly mutated by writes through the pointer.
His (Andrei Alexandrescu) reasoning seems to be based on the fact that using a register for a pointer variable is generally harder for compiler since a pointer could be pointing a to global data. But I don't see anything specific to 32-bit array indexing (To my reading, the slide wasn't quite clear if he was actually referring to 32-bit arrays or array indexing 32-bit systems)
From the horse's mouth: (yes, it's a link to his Facebook account :)
Minimize array writes
To be faster, code should reduce the number of array writes, and more generally, writes through pointers.
On modern machines with large register files and ample register renaming hardware, you can assume that most named individual variables (numbers, pointers) end up sitting in registers. Operating with registers is fast and plays into the strengths of the hardware setup. Even when data dependencies--a major enemy of instruction--level parallelism - come into play, CPUs have special hardware dedicated to managing various dependency patterns. Operating with registers (i.e. named variables) is betting on the house. Do it.
In contrast, array operations (and general indirect accesses) are less natural across the entire compiler-processor-cache hierarchy. Save for a few obvious patterns, array accesses are not registered. Also, whenever pointers are involved, the compiler must assume the pointers could point to global data, meaning any function call may change pointed-to data arbitrarily. And of array operations, array writes are the worst of the pack. Given that all traffic with memory is done at cache-line granularity, writing one word to memory is essentially a cache line read followed by a cache line write. So given that to a good extent array reads are inevitable anyway, this piece of advice boils down to "avoid array writes wherever possible.
He also seems to suggest, it's general suggestion rather than it's always faster to use array indexing(from the same post):
A few good, but less known, things to do for fast code:
Prefer static linking and position-dependent code (as opposed to PIC, position-independent code).
Prefer 64-bit code and 32-bit data.
Prefer array indexing to pointers (this one seems to reverse every ten years).
Prefer regular memory access patterns. Minimize control flow.
Avoid data dependencies.
I wrote an email to Andrei Alexandrescu and he was kind enough to reply. Here is his explanation:
"In order for the speedup to be visible, you need to exploit the ALU's capability of running either two 32-bit operations or one 64-bit operation in one cycle. Not every benchmark will show a speedup."
I understand it as "SIMD instructions process more data per cycle with 32-bit data than 64-bit data". I have yet to find a benchmark (that doesn't contain any array of integers) where it makes a difference. But I assume it's going to be difficult. Andrei used to work for Facebook where every single percentage was worth getting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With