Unexpected performance curve from CPython merge sort

Question

I have implemented a naive merge sorting algorithm in Python. Algorithm and test code is below:

import time
import random
import matplotlib.pyplot as plt
import math
from collections import deque

def sort(unsorted):
    if len(unsorted) <= 1:
        return unsorted
    to_merge = deque(deque([elem]) for elem in unsorted)
    while len(to_merge) > 1:
        left = to_merge.popleft()
        right = to_merge.popleft()
        to_merge.append(merge(left, right))
    return to_merge.pop()

def merge(left, right):
    result = deque()
    while left or right:
        if left and right:
            elem = left.popleft() if left[0] > right[0] else right.popleft()
        elif not left and right:
            elem = right.popleft()
        elif not right and left:
            elem = left.popleft()
        result.append(elem)
    return result

LOOP_COUNT = 100
START_N = 1
END_N = 1000

def test(fun, test_data):
    start = time.clock()
    for _ in xrange(LOOP_COUNT):
        fun(test_data)
    return  time.clock() - start

def run_test():
    timings, elem_nums = [], []
    test_data = random.sample(xrange(100000), END_N)
    for i in xrange(START_N, END_N):
        loop_test_data = test_data[:i]
        elapsed = test(sort, loop_test_data)
        timings.append(elapsed)
        elem_nums.append(len(loop_test_data))
        print "%f s --- %d elems" % (elapsed, len(loop_test_data))
    plt.plot(elem_nums, timings)
    plt.show()

run_test()

As much as I can see everything is OK and I should get a nice N*logN curve as a result. But the picture differs a bit:

enter image description here

Things I've tried to investigate the issue:

PyPy. The curve is ok.
Disabled the GC using the gc module. Wrong guess. Debug output showed that it doesn't even run until the end of the test.
Memory profiling using meliae - nothing special or suspicious. ` I had another implementation (a recursive one using the same merge function), it acts the similar way. The more full test cycles I create - the more "jumps" there are in the curve.

So how can this behaviour be explained and - hopefully - fixed?

UPD: changed lists to collections.deque

UPD2: added the full test code

UPD3: I use Python 2.7.1 on a Ubuntu 11.04 OS, using a quad-core 2Hz notebook. I tried to turn of most of all other processes: the number of spikes went down but at least one of them was still there.

max · Accepted Answer

You are simply picking up the impact of other processes on your machine.

You run your sort function 100 times for input size 1 and record the total time spent on this. Then you run it 100 times for input size 2, and record the total time spent. You continue doing so until you reach input size 1000.

Let's say once in a while your OS (or you yourself) start doing something CPU-intensive. Let's say this "spike" lasts as long as it takes you to run your sort function 5000 times. This means that the execution times would look slow for 5000 / 100 = 50 consecutive input sizes. A while later, another spike happens, and another range of input sizes look slow. This is precisely what you see in your chart.

I can think of one way to avoid this problem. Run your sort function just once for each input size: 1, 2, 3, ..., 1000. Repeat this process 100 times, using the same 1000 inputs (it's important, see explanation at the end). Now take the minimum time spent for each input size as your final data point for the chart.

That way, your spikes should only affect each input size only a few times out of 100 runs; and since you're taking the minimum, they will likely have no impact on the final chart at all.

If your spikes are really really long and frequent, you of course might want to increase the number of repetitions beyond the current 100 per input size.

Looking at your spikes, I notice the execution slows down exactly 3 times during a spike. I'm guessing the OS gives your python process one slot out of three during high load. Whether my guess is correct or not, the approach I recommend should resolve the issue.

EDIT:

I realized that I didn't clarify one point in my proposed solution to your problem.

Should you use the same input in each of your 100 runs for the given input size? Or should use 100 different (random) inputs?

Since I recommended to take the minimum of the execution times, the inputs should be the same (otherwise you'll be getting incorrect output, as you'll measuring the best-case algorithm complexity instead of the average complexity!).

But when you take the same inputs, you create some noise in your chart since some inputs are simply faster than others.

So a better solution is to resolve the system load problem, without creating the problem of only one input per input size (this is obviously pseudocode):

seed = 'choose whatever you like'
repeats = 4
inputs_per_size = 25
runtimes = defaultdict(lambda : float('inf'))
for r in range(repeats):
  random.seed(seed)
  for i in range(inputs_per_size):
    for n in range(1000):
      input = generate_random_input(size = n)
      execution_time = get_execution_time(input)
      if runtimes[(n, i)] > execution_time:
        runtimes[(n,i)] = execution_time
for n in range(1000):
  runtimes[n] = sum(runtimes[(n,i)] for i in range(inputs_per_size))/inputs_per_size

Now you can use runtimes[n] to build your plot.

Of course, depending if your system is super-noisy, you might change (repeats, inputs_per_size) from (4,25) to say, (10,10), or even (25,4).

jfs · Answer

I can reproduce the spikes using your code:

spikes in measured time performance

You should choose an appropriate timing function (time.time() vs. time.clock() -- from timeit import default_timer), number of repetitions in a test (how long each test takes), and number of tests to choose the minimal time from. It gives you a better precision and less external influence on the results. Read the note from timeit.Timer.repeat() docs:

It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

timeit module can choose appropriate parameters for you:

$ python -mtimeit -s 'from m import testdata, sort; a = testdata[:500]' 'sort(a)'

Here's timeit-based performance curve:

time peformance of sorting functions

The figure shows that sort() behavior is consistent with O(n*log(n)):

|------------------------------+-------------------|
| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 1.00  log2(N)   +  1.25e-015 | N                 |
| 2.00  log2(N)   +  5.31e-018 | N*N               |
| 1.19  log2(N)   +      1.116 | N*log2(N)         |
| 1.37  log2(N)   +      2.232 | N*log2(N)*log2(N) |

To generate the figure I've used make-figures.py:

$ python make-figures.py --nsublists 1 --maxn=0x100000 -s vkazanov.msort -s vkazanov.msort_builtin

where:

# adapt sorting functions for make-figures.py
def msort(lists):
    assert len(lists) == 1
    return sort(lists[0]) # `sort()` from the question

def msort_builtin(lists):
    assert len(lists) == 1
    return sorted(lists[0]) # builtin

Input lists are described here (note: the input is sorted so builtin sorted() function shows expected O(N) performance).

Unexpected performance curve from CPython merge sort

Tags:

python

merge

sorting

garbage-collection

Vlad

2 Answers

max

jfs

Recent Activity

Donate For Us

Unexpected performance curve from CPython merge sort

Tags:

python

merge

sorting

garbage-collection

Vlad

2 Answers

max

jfs

Related questions

Recent Activity

Donate For Us