I have two implementations for a particular problem, one recursive and one iterative, and I want to know what causes the iterative solution to be ~30% slower than the recursive one.
Given the recursive solution, I write an iterative solution making the stack explicit.
Clearly, I simply mimic what the recursion is doing, so of course the Python engine is better optimized to handle the bookkeeping. But can we write an iterative method with similar performance?
My case study is Problem #14 on Project Euler.
Find the longest Collatz chain with a starting number below one million.
Here is a parsimonious recursive solution (credit due to veritas in the problem thread plus an optimization from jJjjJ):
def solve_PE14_recursive(ub=10**6):
    def collatz_r(n):
        if not n in table:
            if n % 2 == 0:
                table[n] = collatz_r(n // 2) + 1
            elif n % 4 == 3:
                table[n] = collatz_r((3 * n + 1) // 2) + 2
            else:
                table[n] = collatz_r((3 * n + 1) // 4) + 3
        return table[n]
    table = {1: 1}
    return max(xrange(ub // 2 + 1, ub, 2), key=collatz_r)
Here's my iterative version:
def solve_PE14_iterative(ub=10**6):
    def collatz_i(n):
        stack = []
        while not n in table:
            if n % 2 == 0:
                x, y = n // 2, 1
            elif n % 4 == 3:
                x, y = (3 * n + 1) // 2, 2
            else:
                x, y = (3 * n + 1) // 4, 3
            stack.append((n, y))
            n = x
        ysum = table[n]
        for x, y in reversed(stack):
            ysum += y
            table[x] = ysum
        return ysum
    table = {1: 1}
    return max(xrange(ub // 2 + 1, ub, 2), key=collatz_i)
And the timings on my machine (i7 machine with lots of memory) using IPython:
In [3]: %timeit solve_PE14_recursive()
1 loops, best of 3: 942 ms per loop
In [4]: %timeit solve_PE14_iterative()
1 loops, best of 3: 1.35 s per loop
The recursive solution is awesome:
collatz_r(9780657630) returns 1133 but requires less than 1000 recursive calls.collatz_r length calculated on-demand for max
Playing around with it, timings seem to be precise to +/- 5 ms.
Languages with static typing like C and Haskell can get timings below 100 ms.
I put the initialization of the memoization table in the method by design for this question, so that timings would reflect the "re-discovery" of the table values on each invocation.
collatz_r(2**1002) raises RuntimeError: maximum recursion depth exceeded.collatz_i(2**1002) happily returns with 1003.
I am familiar with generators, coroutines, and decorators.
I am using Python 2.7. I am also happy to use Numpy (1.8 on my machine).
I'm looking mostly for the first, though the second and third are very important to this problem and would increase my understanding of Python.
In Recursion format, a stack is used to store the set of new local variables and parameters each time the function is called. Since recursion makes use of the stack data structure and due to this overhead, it is slower than the iteration code format.
Recursion has a large amount of overhead as compared to Iteration. It is usually much slower because all function calls must be stored in a stack to allow the return back to the caller functions.
Recursion terminates when a base case is recognized. Recursion is usually slower than iteration due to the overhead of maintaining the stack. Recursion uses more memory than iteration. Recursion makes the code smaller.
An Iterative algorithm will be faster than the Recursive algorithm because of overheads like calling functions and registering stacks repeatedly. Many times the recursive algorithms are not efficient as they take more space and time.
Here's my shot at a (partial) explanation after running some benchmarks, which confirm your figures.
While recursive function calls are expensive in CPython, they aren't nearly as expensive as emulating a call stack using lists. The stack for a recursive call is a compact structure implemented in C (see Eli Bendersky's explanation and the file Python/ceval.c in the source code).
By contrast, your emulated stack is a Python list object, i.e. a heap-allocated, dynamically growing array of pointers to tuple objects, which in turn point to the actual values; goodbye, locality of reference, hello cache misses. You then use Python's notoriously slow iteration on these objects. A line-by-line profiling with kernprof confirms that iteration and list handling are taking a lot of time:
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                               @profile
    17                                               def collatz_i(n):
    18    750000       339195      0.5      2.4          stack = []
    19   3702825      1996913      0.5     14.2          while not n in table:
    20   2952825      1329819      0.5      9.5              if n % 2 == 0:
    21    864633       416307      0.5      3.0                  x, y = n // 2, 1
    22   2088192       906202      0.4      6.4              elif n % 4 == 3:
    23   1043583       617536      0.6      4.4                  x, y = (3 * n + 1) // 2, 2
    24                                                       else:
    25   1044609       601008      0.6      4.3                  x, y = (3 * n + 1) // 4, 3
    26   2952825      1543300      0.5     11.0              stack.append((n, y))
    27   2952825      1150867      0.4      8.2              n = x
    28    750000       352395      0.5      2.5          ysum = table[n]
    29   3702825      1693252      0.5     12.0          for x, y in reversed(stack):
    30   2952825      1254553      0.4      8.9              ysum += y
    31   2952825      1560177      0.5     11.1              table[x] = ysum
    32    750000       305911      0.4      2.2          return ysum
Interestingly, even n = x takes around 8% of the total running time.
(Unfortunately, I couldn't get kernprof to produce something similar for the recursive version.)
Iterative code is sometimes faster than recursive because it avoids function call overhead. However, stack.append is also a function call (and an attribute lookup on top) and adds similar overhead. Counting the append calls, the iterative version makes just as many function calls as the recursive version.
Comparing the first two and the last two timings here...
$ python -m timeit pass
10000000 loops, best of 3: 0.0242 usec per loop
$ python -m timeit -s "def f(n): pass" "f(1)"
10000000 loops, best of 3: 0.188 usec per loop
$ python -m timeit -s "def f(n): x=[]" "f(1)"
1000000 loops, best of 3: 0.234 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append" "f(1)"
1000000 loops, best of 3: 0.336 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append(1)" "f(1)"
1000000 loops, best of 3: 0.499 usec per loop
...confirms that the append call excluding attribute lookup takes approximately the same time as calling a minimal pure Python function, ~170 ns.
From the above I conclude that the iterative version does not enjoy an inherent advantage. The next question to consider is which one does more work. To get a (very) rough estimate, we can look at the number of lines executed in each version. I did a quick experiment to find out that:
collatz_r is called 1234275 times, and the body of the if block executes 984275 times.collatz_i is called 250000 times, and the while loop runs 984275 timesNow, let's say collatz_r has 2 lines outside the if and 4 lines inside (that are executed in the worst case, when the else is hit). That adds up to 6.4 million lines to execute. Comparable figures for collatz_i could be 5 and 9, which add up to 10.0 million.
Even if that was just a rough estimate, it is well enough in line with the actual timings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With