I have been working on a python script to analyze CSVs. Some of these files are fairly large (1-2 million records), and the script was taking hours to complete.
I changed the way the records are processed from a for-in loop to a while loop, and the speedup was remarkable.  Demonstration below:
>>> def for_list():
...     for d in data:
...             bunk = d**d
... 
>>> def while_list():
...     while data:
...             d = data.pop(0)
...             bunk = d**d
... 
>>> data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> import timeit
>>> timeit.timeit(for_list)
1.0698931217193604
>>> timeit.timeit(while_list)
0.14515399932861328
Almost an order of magnitude faster.  I've never looked at python bytecode, but I though it might be telling, but it turns out while_list has more instructions.
So what's going on here?  Is there a principle here I can apply to other programs?  Are there scenarios where the for would be ten times faster than the while?
EDIT: As @HappyLeapSecond pointed out, I didn't quite understand exactly what was going on inside timeit  The discrepancy is gone with the following:
>>> def for_list():
...     data = [x for x in range(1000)]
...     for d in data:
...             bunk = d**d
... 
>>> def while_list():
...     data = [x for x in range(1000)]
...     while data:
...             d = data.pop(0)
...             bunk = d**d
>>> timeit.timeit(while_list, number=1000)
12.006330966949463
>>> timeit.timeit(for_list, number=1000)
11.847280025482178
Which makes it very odd that my "real" script sped up so much with such a simple change. My best guess is that the iteration method is requiring more swapping? I have a 40G swap partition, the script fills about 15-20G of it. Would popping reduce swapping?
while_list is mutating the global data. timeit.timeit does not reset the value of data. timeit.timeit calls for_list and  while_list a million times each by default. After the first call to while_list, subsequent calls to while_list return after performing 0 loops because data is already empty.
You need to reset the value of data before each call to for_list and while_list to perform a fair benchmark.
import timeit
def for_list(data):
    for d in data:
        bunk = d ** d
def while_list(data):
    while data:
        d = data.pop(0)
        bunk = d ** d
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(timeit.timeit('data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; for_list(data)', 'from __main__ import for_list'))
# 0.959696054459
print(timeit.timeit('data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; while_list(data)', 'from __main__ import while_list'))
# 2.40107011795
pop(0) is a O(n) operation. Performing that inside a loop of length n makes while_list have an overall time complexity O(n**2), compared to O(n) complexity for for_list. So as expected, for_list is faster, and the advantage grows as n, the length of data, gets larger.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With