Why is this iterative Collatz method 30% slower than its recursive version in Python?

Prelude

I have two implementations for a particular problem, one recursive and one iterative, and I want to know what causes the iterative solution to be ~30% slower than the recursive one.

Given the recursive solution, I write an iterative solution making the stack explicit.
Clearly, I simply mimic what the recursion is doing, so of course the Python engine is better optimized to handle the bookkeeping. But can we write an iterative method with similar performance?

My case study is Problem #14 on Project Euler.

Find the longest Collatz chain with a starting number below one million.

Code

Here is a parsimonious recursive solution (credit due to veritas in the problem thread plus an optimization from jJjjJ):

def solve_PE14_recursive(ub=10**6):
    def collatz_r(n):
        if not n in table:
            if n % 2 == 0:
                table[n] = collatz_r(n // 2) + 1
            elif n % 4 == 3:
                table[n] = collatz_r((3 * n + 1) // 2) + 2
            else:
                table[n] = collatz_r((3 * n + 1) // 4) + 3
        return table[n]
    table = {1: 1}
    return max(xrange(ub // 2 + 1, ub, 2), key=collatz_r)

Here's my iterative version:

def solve_PE14_iterative(ub=10**6):
    def collatz_i(n):
        stack = []
        while not n in table:
            if n % 2 == 0:
                x, y = n // 2, 1
            elif n % 4 == 3:
                x, y = (3 * n + 1) // 2, 2
            else:
                x, y = (3 * n + 1) // 4, 3
            stack.append((n, y))
            n = x
        ysum = table[n]
        for x, y in reversed(stack):
            ysum += y
            table[x] = ysum
        return ysum
    table = {1: 1}
    return max(xrange(ub // 2 + 1, ub, 2), key=collatz_i)

And the timings on my machine (i7 machine with lots of memory) using IPython:

In [3]: %timeit solve_PE14_recursive()
1 loops, best of 3: 942 ms per loop
In [4]: %timeit solve_PE14_iterative()
1 loops, best of 3: 1.35 s per loop

Comments

The recursive solution is awesome:

Optimized to skip a step or two depending on the two least significant bits.
My original solution didn't skip any Collatz steps and took ~1.86 s
It is difficult to hit Python's default recursion limit of 1000.
collatz_r(9780657630) returns 1133 but requires less than 1000 recursive calls.
Memoization avoids retracing
collatz_r length calculated on-demand for max

Playing around with it, timings seem to be precise to +/- 5 ms.
Languages with static typing like C and Haskell can get timings below 100 ms.
I put the initialization of the memoization table in the method by design for this question, so that timings would reflect the "re-discovery" of the table values on each invocation.

collatz_r(2**1002) raises RuntimeError: maximum recursion depth exceeded.
collatz_i(2**1002) happily returns with 1003.

I am familiar with generators, coroutines, and decorators.
I am using Python 2.7. I am also happy to use Numpy (1.8 on my machine).

What I am looking for

an iterative solution that closes the performance gap
discussion on how Python handles recursion
the finer details of the performance penalties associated with an explicit stack

I'm looking mostly for the first, though the second and third are very important to this problem and would increase my understanding of Python.

876

asked Feb 11 '14 07:02

Prashant Kumar

2 Answers

Here's my shot at a (partial) explanation after running some benchmarks, which confirm your figures.

While recursive function calls are expensive in CPython, they aren't nearly as expensive as emulating a call stack using lists. The stack for a recursive call is a compact structure implemented in C (see Eli Bendersky's explanation and the file Python/ceval.c in the source code).

By contrast, your emulated stack is a Python list object, i.e. a heap-allocated, dynamically growing array of pointers to tuple objects, which in turn point to the actual values; goodbye, locality of reference, hello cache misses. You then use Python's notoriously slow iteration on these objects. A line-by-line profiling with kernprof confirms that iteration and list handling are taking a lot of time:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                               @profile
    17                                               def collatz_i(n):
    18    750000       339195      0.5      2.4          stack = []
    19   3702825      1996913      0.5     14.2          while not n in table:
    20   2952825      1329819      0.5      9.5              if n % 2 == 0:
    21    864633       416307      0.5      3.0                  x, y = n // 2, 1
    22   2088192       906202      0.4      6.4              elif n % 4 == 3:
    23   1043583       617536      0.6      4.4                  x, y = (3 * n + 1) // 2, 2
    24                                                       else:
    25   1044609       601008      0.6      4.3                  x, y = (3 * n + 1) // 4, 3
    26   2952825      1543300      0.5     11.0              stack.append((n, y))
    27   2952825      1150867      0.4      8.2              n = x
    28    750000       352395      0.5      2.5          ysum = table[n]
    29   3702825      1693252      0.5     12.0          for x, y in reversed(stack):
    30   2952825      1254553      0.4      8.9              ysum += y
    31   2952825      1560177      0.5     11.1              table[x] = ysum
    32    750000       305911      0.4      2.2          return ysum

Interestingly, even n = x takes around 8% of the total running time.

(Unfortunately, I couldn't get kernprof to produce something similar for the recursive version.)

answered Sep 28 '22 13:09

Fred Foo

Iterative code is sometimes faster than recursive because it avoids function call overhead. However, stack.append is also a function call (and an attribute lookup on top) and adds similar overhead. Counting the append calls, the iterative version makes just as many function calls as the recursive version.

Comparing the first two and the last two timings here...

$ python -m timeit pass
10000000 loops, best of 3: 0.0242 usec per loop
$ python -m timeit -s "def f(n): pass" "f(1)"
10000000 loops, best of 3: 0.188 usec per loop
$ python -m timeit -s "def f(n): x=[]" "f(1)"
1000000 loops, best of 3: 0.234 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append" "f(1)"
1000000 loops, best of 3: 0.336 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append(1)" "f(1)"
1000000 loops, best of 3: 0.499 usec per loop

...confirms that the append call excluding attribute lookup takes approximately the same time as calling a minimal pure Python function, ~170 ns.

From the above I conclude that the iterative version does not enjoy an inherent advantage. The next question to consider is which one does more work. To get a (very) rough estimate, we can look at the number of lines executed in each version. I did a quick experiment to find out that:

collatz_r is called 1234275 times, and the body of the if block executes 984275 times.
collatz_i is called 250000 times, and the while loop runs 984275 times

Now, let's say collatz_r has 2 lines outside the if and 4 lines inside (that are executed in the worst case, when the else is hit). That adds up to 6.4 million lines to execute. Comparable figures for collatz_i could be 5 and 9, which add up to 10.0 million.

Even if that was just a rough estimate, it is well enough in line with the actual timings.

answered Sep 30 '22 13:09

Janne Karila

Related questions
                            
                                Can I run multiprocessing Python programs on a single core machine?
                            
                                When does Pandas default to broadcasting Series and Dataframes?
                            
                                python conditional debug breakpoint one-liner for versions before 3.7 PEP 553 that acts similarly to Perl's $DB::single=1
                            
                                matplotlib scatter fails with error: 'c' argument has n elements, which is not acceptable for use with 'x' with size n, 'y' with size n
                            
                                How to export a graph with alternative text in Excel to PDF using Python or R?
                            
                                Jupyter Lab freezes the computer when out of RAM - how to prevent it?
                            
                                Installation of earlier versions of Python (prior to 3.8) fails on Mac with M1 Chip (ARM based Apple Silicon)
                            
                                Extended slice that goes to beginning of sequence with negative stride
                            
                                Django QuerySet Custom Ordering by ID
                            
                                Is it possible to require PyQt from setuptools setup.py?
                            
                                how to enforce sqlite select for update transaction behavior in sqlalchemy
                            
                                When and why might I assign an instance of a descriptor class to a class attribute in Python rather than use a property?
                            
                                Inline-like solution for Django Admin where Admin contains ForeignKey to other model
                            
                                Hello World in mod_wsgi
                            
                                Python: How to abruptly disconnect a socket without closing it appropriately
                            
                                Why does setattr and getattr allow whitespace?
                            
                                How to sniff HTTP packets in python?
                            
                                Converting irregularly time stamped measurements into equally spaced, time-weighted averages
                            
                                pep8 compliant deep dictionary access
                            
                                Joining onto a subquery in SQLAlchemy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is this iterative Collatz method 30% slower than its recursive version in Python?

Tags:

performance

python

iteration

recursion