Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Why is popping off a queue faster than for-in block?

I have been working on a python script to analyze CSVs. Some of these files are fairly large (1-2 million records), and the script was taking hours to complete.

I changed the way the records are processed from a for-in loop to a while loop, and the speedup was remarkable. Demonstration below:

>>> def for_list():
...     for d in data:
...             bunk = d**d
... 
>>> def while_list():
...     while data:
...             d = data.pop(0)
...             bunk = d**d
... 
>>> data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> import timeit
>>> timeit.timeit(for_list)
1.0698931217193604
>>> timeit.timeit(while_list)
0.14515399932861328

Almost an order of magnitude faster. I've never looked at python bytecode, but I though it might be telling, but it turns out while_list has more instructions.

So what's going on here? Is there a principle here I can apply to other programs? Are there scenarios where the for would be ten times faster than the while?

EDIT: As @HappyLeapSecond pointed out, I didn't quite understand exactly what was going on inside timeit The discrepancy is gone with the following:

>>> def for_list():
...     data = [x for x in range(1000)]
...     for d in data:
...             bunk = d**d
... 
>>> def while_list():
...     data = [x for x in range(1000)]
...     while data:
...             d = data.pop(0)
...             bunk = d**d
>>> timeit.timeit(while_list, number=1000)
12.006330966949463
>>> timeit.timeit(for_list, number=1000)
11.847280025482178

Which makes it very odd that my "real" script sped up so much with such a simple change. My best guess is that the iteration method is requiring more swapping? I have a 40G swap partition, the script fills about 15-20G of it. Would popping reduce swapping?

like image 981
Will Avatar asked Jul 14 '15 18:07

Will


1 Answers

while_list is mutating the global data. timeit.timeit does not reset the value of data. timeit.timeit calls for_list and while_list a million times each by default. After the first call to while_list, subsequent calls to while_list return after performing 0 loops because data is already empty.

You need to reset the value of data before each call to for_list and while_list to perform a fair benchmark.


import timeit

def for_list(data):
    for d in data:
        bunk = d ** d


def while_list(data):
    while data:
        d = data.pop(0)
        bunk = d ** d

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

print(timeit.timeit('data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; for_list(data)', 'from __main__ import for_list'))
# 0.959696054459

print(timeit.timeit('data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; while_list(data)', 'from __main__ import while_list'))
# 2.40107011795

pop(0) is a O(n) operation. Performing that inside a loop of length n makes while_list have an overall time complexity O(n**2), compared to O(n) complexity for for_list. So as expected, for_list is faster, and the advantage grows as n, the length of data, gets larger.

like image 90
unutbu Avatar answered Oct 03 '22 09:10

unutbu