On a mailing list I'm subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:
On the CPUs released 5-8 years ago, it was slightly faster to iterate for loops backwards (e.g.
for (int i=x-1; i>=0; i--) {...}
) because comparingi
to zero is more efficient than comparing it to some other number. But with very recent CPUs (e.g. from 2008-2009) the speculative loader logic is such that it works better if the for loop is iterated forward (e.g.for (int i=0; i< x; i++) {...}
).
My question is, is that true? Have CPU implementations changed recently such that forward-loop-iterating now has an advantage over backward-iterating? If so, what is the explanation for that? i.e. what changed?
(Yes, I know, premature optimization is the root of all evil, review my algorithm before worrying about micro-optimizations, etc etc... mostly I'm just curious)
Conclusions. List comprehensions are often not only more readable but also faster than using “for loops.” They can simplify your code, but if you put too much logic inside, they will instead become harder to read and understand.
For loop (forward and reverse) The traditional for loop is the fastest, so you should always use that right? Not so fast - performance is not the only thing that matters.
forEach loop The forEach method in Javascript iterates over the elements of an array and calls the provided function for each element in order. The execution time of forEach is dramatically affected by what happens inside each iteration. It is fast and designed for functional code.
The fastest loop is a for loop, both with and without caching length delivering really similar performance.
You're really asking about prefetching, not about loop control logic.
In general, loop performance isn't going to be dictated by the control logic (i.e. the increment/decrement and the condition that gets checked every time through). The time it takes to do these things is inconsequential except in very tight loops. If you're interested in that, take a look at John Knoeller's answer for specifics on the 8086's counter register and why it might've been true in the old days that counting down was more efficient. As John says, branch prediction (and also speculation) can play a role in performance here, as can instruction prefetching.
Iteration order can affect performance significantly when it changes the order in which your loop touches memory. The order in which you request memory addresses can affect what is drawn into your cache and also what is evicted from your cache when there is no longer room to fetch new cache lines. Having to go to memory more often than needed is much more expensive than compares, increments, or decrements. On modern CPUs it can take thousands of cycles to get from the processor to memory, and your processor may have to idle for some or all of that time.
You're probably familiar with caches, so I won't go into all those details here. What you may not know is that modern processors employ a whole slew of prefetchers to try to predict what data you're going to need next at different levels of the memory hierarchy. Once they predict, they try to pull that data from memory or lower level caches so that you have what you need when you get around to processing it. Depending on how well they grab what you need next, your performance may or may not improve when using them.
Take a look at Intel's guide to optimizing for hardware prefetchers. There are four prefetchers listed; two for NetBurst chips:
and two for Core:
If you're iterating through an array forward, you're going to generate a bunch of sequential, usually contiguous memory references. The ACL prefetchers are going to do much better for forward loops (because you'll end up using those subsequent cache lines) than for backward loops, but you may do ok making memory references backward if the prefetchers can detect this (as with the hardware prefetchers). The hardware prefetchers on the Core can detect strides, which is helpful for for more sophisticated array traversals.
These simple heuristics can get you into trouble in some cases. For example, Intel actually recommends that you turn off adjacent cache line prefetching for servers, because they tend to make more random memory references than desktop user machines. The probability of not using an adjacent cache line is higher on a server, so fetching data you're not actually going to use ends up polluting your cache (filling it with unwanted data), and performance suffers. For more on addressing this kind of problem, take a look at this paper from Supercomputing 2009 on using machine learning to tune prefetchers in large data centers. Some guys at Google are on that paper; performance is something that is of great concern to them.
Simple heuristics aren't going to help you with more sophisticated algorithms, and you might have to start thinking about the sizes of your L1, L2, etc. caches. Image processing, for example, often requires that you perform some operation on subsections of a 2D image, but the order you traverse the image can affect how well useful pieces of it stay in your cache without being evicted. Take a look at Z-order traversals and loop tiling if you're interested in this sort of thing. It's a pretty basic example of mapping the 2D locality of image data to the 1D locality of memory to improve performance. It's also an area where compilers aren't always able to restructure your code in the best way, but manually restructuring your C code can improve cache performance drastically.
I hope this gives you an idea of how iteration order affects memory performance. It does depend on the particular architecture, but the ideas are general. You should be able to understand prefetching on AMD and Power if you can understand it on Intel, and you don't really have to know assembly to structure your code to take advantage of memory. You just need to know a little computer architecture.
I don't know. But I do know how to write a quick benchmark with no guarantees of scientific validity (actually, one with rather strict guarantees of invalidity). It has interesting results:
#include <time.h>
#include <stdio.h>
int main(void)
{
int i;
int s;
clock_t start_time, end_time;
int centiseconds;
start_time = clock();
s = 1;
for (i = 0; i < 1000000000; i++)
{
s = s + i;
}
end_time = clock();
centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
printf("Answer is %d; Forward took %ld centiseconds\n", s, centiseconds);
start_time = clock();
s = 1;
for (i = 999999999; i >= 0; i--)
{
s = s + i;
}
end_time = clock();
centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
printf("Answer is %d; Backward took %ld centiseconds\n", s, centiseconds);
return 0;
}
Compiled with -O9 using gcc 3.4.4 on Cygwin, running on an "AMD Athlon(tm) 64 Processor 3500+" (2211 MHz) in 32 bit Windows XP:
Answer is -1243309311; Forward took 93 centiseconds
Answer is -1243309311; Backward took 92 centiseconds
(Answers varied by 1 either way in several repetitions.)
Compiled with -I9 using gcc 4.4.1 running on an "Intel(R) Atom(TM) CPU N270 @ 1.60GHz" (800 MHz and presumably only one core, given the program) in 32 bit Ubuntu Linux.
Answer is -1243309311; Forward took 196 centiseconds
Answer is -1243309311; Backward took 228 centiseconds
(Answers varied by 1 either way in several repetitions.)
Looking at the code, the forward loop is translated to:
; Gcc 3.4.4 on Cygwin for Athlon ; Gcc 4.4.1 on Ubuntu for Atom
L5: .L2:
addl %eax, %ebx addl %eax, %ebx
incl %eax addl $1, %eax
cmpl $999999999, %eax cmpl $1000000000, %eax
jle L5 jne .L2
The backward to:
L9: .L3:
addl %eax, %ebx addl %eax, %ebx
decl %eax subl $1, $eax
jns L9 cmpl $-1, %eax
jne .L3
Which shows, if not much else, that GCC's behaviour has changed between those two versions!
Pasting the older GCC's loops into the newer GCC's asm file gives results of:
Answer is -1243309311; Forward took 194 centiseconds
Answer is -1243309311; Backward took 133 centiseconds
Summary: on the >5 year old Athlon, the loops generated by GCC 3.4.4 are the same speed. On the newish (<1 year?) Atom, the backward loop is significantly faster. GCC 4.4.1 has a slight regression for this particular case which I personally am not bothered about in the least, given the point of it. (I had to make sure that s
is used after the loop, because otherwise the compiler would elide the computation altogether.)
[1] I can never remember the command for system info...
Yes. but with a caveat. The idea that looping backwards is faster never applied to all older CPUs. It's an x86 thing (as in 8086 through 486, possibly Pentium, although I don't think any further).
That optimization never applied to any other CPU architecture that I know of.
Here's why.
The 8086 had a register that was specifically optimized for use as a loop counter. You put your loop count in CX, and then there are several instructions that decrement CX and then set condition codes if it goes to zero. In fact there was an instruction prefix you could put before other instructions (the REP prefix) that would basically iterate the other instruction until CX got to 0.
Back in the days when we counted instructions and instructions had known fixed cycle counts using cx as your loop counter was the way to go, and cx was optimized for counting down.
But that was a long time ago. Ever since the Pentium, those complex instructions have been slower overall than using more, and simpler instructions. (RISC baby!) The key thing we try to do these days is try to put some time between loading a register and using it because the pipelines can actually do multiple things per cycle as long as you don't try to use the same register for more than one thing at a time.
Nowdays the thing that kills performance isn't the comparison, it's the branching, and then only when the branch prediction predicts wrong.
I stumbled upon this question after observing a significant drop in performance when iterating over an array backwards vs forwards. I was afraid it would be the prefetcher, but the previous answers convinced me this was not the case. I then investigated further and found out that it looks like GCC (4.8.4) is unable to exploit the full power of SIMD operations in a backward loop.
In fact, compiling the following code (from here) with -S -O3 -mavx
:
for (i = 0; i < N; ++i)
r[i] = (a[i] + b[i]) * c[i];
leads to essentially:
.L10:
addl $1, %edx
vmovupd (%rdi,%rax), %xmm1
vinsertf128 $0x1, 16(%rdi,%rax), %ymm1, %ymm1
vmovupd (%rsi,%rax), %xmm0
vinsertf128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0
vaddpd (%r9,%rax), %ymm1, %ymm1
vmulpd %ymm0, %ymm1, %ymm0
vmovupd %xmm0, (%rcx,%rax)
vextractf128 $0x1, %ymm0, 16(%rcx,%rax)
addq $32, %rax
cmpl %r8d, %edx
jb .L10
i.e. assembly code that uses the AVX extensions to perform four double operations in parallel (for example, vaddpd and vmulpd).
Conversely, the following code compiled with the same parameters:
for (i = 0; i < N; ++i)
r[N-1-i] = (a[N-1-i] + b[N-1-i]) * c[N-1-i];
produces:
.L5:
vmovsd a+79992(%rax), %xmm0
subq $8, %rax
vaddsd b+80000(%rax), %xmm0, %xmm0
vmulsd c+80000(%rax), %xmm0, %xmm0
vmovsd %xmm0, r+80000(%rax)
cmpq $-80000, %rax
jne .L5
which only performs one double operation at the time (vaddsd, vmulsd).
This fact alone may be responsible for a factor of 4 between the performance when iterating backward vs forward.
Using -ftree-vectorizer-verbose=2
, it looks like the problem is storing backwards: "negative step for store". In fact, if a
, b
, and c
are read backwards, but r
is written into in the forward direction, and the code is vectorized again.
It probably doesn't make a hoot of difference speed-wise, but I often write:
for (i = n; --i >= 0; ) blah blah
which I think at one time generated cleaner assembly.
Of course, in answering this kind of question, I run the risk of affirming that this is important. It's a micro-optimization kind of question, which is closely related to premature optimization, which everybody says you shouldn't do, but nevertheless SO is awash in it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With