I have two implementations for a particular problem, one recursive and one iterative, and I want to know what causes the iterative solution to be ~30% slower than the recursive one.
Given the recursive solution, I write an iterative solution making the stack explicit.
Clearly, I simply mimic what the recursion is doing, so of course the Python engine is better optimized to handle the bookkeeping. But can we write an iterative method with similar performance?
My case study is Problem #14 on Project Euler.
Find the longest Collatz chain with a starting number below one million.
Here is a parsimonious recursive solution (credit due to veritas in the problem thread plus an optimization from jJjjJ):
def solve_PE14_recursive(ub=10**6):
def collatz_r(n):
if not n in table:
if n % 2 == 0:
table[n] = collatz_r(n // 2) + 1
elif n % 4 == 3:
table[n] = collatz_r((3 * n + 1) // 2) + 2
else:
table[n] = collatz_r((3 * n + 1) // 4) + 3
return table[n]
table = {1: 1}
return max(xrange(ub // 2 + 1, ub, 2), key=collatz_r)
Here's my iterative version:
def solve_PE14_iterative(ub=10**6):
def collatz_i(n):
stack = []
while not n in table:
if n % 2 == 0:
x, y = n // 2, 1
elif n % 4 == 3:
x, y = (3 * n + 1) // 2, 2
else:
x, y = (3 * n + 1) // 4, 3
stack.append((n, y))
n = x
ysum = table[n]
for x, y in reversed(stack):
ysum += y
table[x] = ysum
return ysum
table = {1: 1}
return max(xrange(ub // 2 + 1, ub, 2), key=collatz_i)
And the timings on my machine (i7 machine with lots of memory) using IPython:
In [3]: %timeit solve_PE14_recursive()
1 loops, best of 3: 942 ms per loop
In [4]: %timeit solve_PE14_iterative()
1 loops, best of 3: 1.35 s per loop
The recursive solution is awesome:
collatz_r(
9780657630
)
returns 1133 but requires less than 1000 recursive calls.collatz_r
length calculated on-demand for max
Playing around with it, timings seem to be precise to +/- 5 ms.
Languages with static typing like C and Haskell can get timings below 100 ms.
I put the initialization of the memoization table
in the method by design for this question, so that timings would reflect the "re-discovery" of the table values on each invocation.
collatz_r(2**1002)
raises RuntimeError: maximum recursion depth exceeded
.collatz_i(2**1002)
happily returns with 1003
.
I am familiar with generators, coroutines, and decorators.
I am using Python 2.7. I am also happy to use Numpy (1.8 on my machine).
I'm looking mostly for the first, though the second and third are very important to this problem and would increase my understanding of Python.
In Recursion format, a stack is used to store the set of new local variables and parameters each time the function is called. Since recursion makes use of the stack data structure and due to this overhead, it is slower than the iteration code format.
Recursion has a large amount of overhead as compared to Iteration. It is usually much slower because all function calls must be stored in a stack to allow the return back to the caller functions.
Recursion terminates when a base case is recognized. Recursion is usually slower than iteration due to the overhead of maintaining the stack. Recursion uses more memory than iteration. Recursion makes the code smaller.
An Iterative algorithm will be faster than the Recursive algorithm because of overheads like calling functions and registering stacks repeatedly. Many times the recursive algorithms are not efficient as they take more space and time.
Here's my shot at a (partial) explanation after running some benchmarks, which confirm your figures.
While recursive function calls are expensive in CPython, they aren't nearly as expensive as emulating a call stack using lists. The stack for a recursive call is a compact structure implemented in C (see Eli Bendersky's explanation and the file Python/ceval.c
in the source code).
By contrast, your emulated stack is a Python list object, i.e. a heap-allocated, dynamically growing array of pointers to tuple objects, which in turn point to the actual values; goodbye, locality of reference, hello cache misses. You then use Python's notoriously slow iteration on these objects. A line-by-line profiling with kernprof
confirms that iteration and list handling are taking a lot of time:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
16 @profile
17 def collatz_i(n):
18 750000 339195 0.5 2.4 stack = []
19 3702825 1996913 0.5 14.2 while not n in table:
20 2952825 1329819 0.5 9.5 if n % 2 == 0:
21 864633 416307 0.5 3.0 x, y = n // 2, 1
22 2088192 906202 0.4 6.4 elif n % 4 == 3:
23 1043583 617536 0.6 4.4 x, y = (3 * n + 1) // 2, 2
24 else:
25 1044609 601008 0.6 4.3 x, y = (3 * n + 1) // 4, 3
26 2952825 1543300 0.5 11.0 stack.append((n, y))
27 2952825 1150867 0.4 8.2 n = x
28 750000 352395 0.5 2.5 ysum = table[n]
29 3702825 1693252 0.5 12.0 for x, y in reversed(stack):
30 2952825 1254553 0.4 8.9 ysum += y
31 2952825 1560177 0.5 11.1 table[x] = ysum
32 750000 305911 0.4 2.2 return ysum
Interestingly, even n = x
takes around 8% of the total running time.
(Unfortunately, I couldn't get kernprof
to produce something similar for the recursive version.)
Iterative code is sometimes faster than recursive because it avoids function call overhead. However, stack.append
is also a function call (and an attribute lookup on top) and adds similar overhead. Counting the append
calls, the iterative version makes just as many function calls as the recursive version.
Comparing the first two and the last two timings here...
$ python -m timeit pass
10000000 loops, best of 3: 0.0242 usec per loop
$ python -m timeit -s "def f(n): pass" "f(1)"
10000000 loops, best of 3: 0.188 usec per loop
$ python -m timeit -s "def f(n): x=[]" "f(1)"
1000000 loops, best of 3: 0.234 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append" "f(1)"
1000000 loops, best of 3: 0.336 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append(1)" "f(1)"
1000000 loops, best of 3: 0.499 usec per loop
...confirms that the append
call excluding attribute lookup takes approximately the same time as calling a minimal pure Python function, ~170 ns.
From the above I conclude that the iterative version does not enjoy an inherent advantage. The next question to consider is which one does more work. To get a (very) rough estimate, we can look at the number of lines executed in each version. I did a quick experiment to find out that:
collatz_r
is called 1234275 times, and the body of the if
block executes 984275 times.collatz_i
is called 250000 times, and the while
loop runs 984275 timesNow, let's say collatz_r
has 2 lines outside the if
and 4 lines inside (that are executed in the worst case, when the else
is hit). That adds up to 6.4 million lines to execute. Comparable figures for collatz_i
could be 5 and 9, which add up to 10.0 million.
Even if that was just a rough estimate, it is well enough in line with the actual timings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With