I was optimising some Python code, and tried the following experiment:
import time start = time.clock() x = 0 for i in range(10000000): x += 1 end = time.clock() print '+=',end-start start = time.clock() x = 0 for i in range(10000000): x -= -1 end = time.clock() print '-=',end-start
The second loop is reliably faster, anywhere from a whisker to 10%, depending on the system I run it on. I've tried varying the order of the loops, number of executions etc, and it still seems to work.
Stranger,
for i in range(10000000, 0, -1):
(ie running the loop backwards) is faster than
for i in range(10000000):
even when loop contents are identical.
What gives, and is there a more general programming lesson here?
I can reproduce this on my Q6600 (Python 2.6.2); increasing the range to 100000000:
('+=', 11.370000000000001) ('-=', 10.769999999999998)
First, some observations:
INPLACE_ADD
vs. INPLACE_SUBTRACT
and +1 vs -1.Looking at the Python source, I can make a guess. This is handled in ceval.c, in PyEval_EvalFrameEx
. INPLACE_ADD
has a significant extra block of code, to handle string concatenation. That block doesn't exist in INPLACE_SUBTRACT
, since you can't subtract strings. That means INPLACE_ADD
contains more native code. Depending (heavily!) on how the code is being generated by the compiler, this extra code may be inline with the rest of the INPLACE_ADD code, which means additions can hit the instruction cache harder than subtraction. This could be causing extra L2 cache hits, which could cause a significant performance difference.
This is heavily dependent on the system you're on (different processors have different amounts of cache and cache architectures), the compiler in use, including the particular version and compilation options (different compilers will decide differently which bits of code are on the critical path, which determines how assembly code is lumped together), and so on.
Also, the difference is reversed in Python 3.0.1 (+: 15.66, -: 16.71); no doubt this critical function has changed a lot.
$ python -m timeit -s "x=0" "x+=1" 10000000 loops, best of 3: 0.151 usec per loop $ python -m timeit -s "x=0" "x-=-1" 10000000 loops, best of 3: 0.154 usec per loop
Looks like you've some measurement bias
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With