I have been working on a python script to analyze CSVs. Some of these files are fairly large (1-2 million records), and the script was taking hours to complete.
I changed the way the records are processed from a for-in
loop to a while
loop, and the speedup was remarkable. Demonstration below:
>>> def for_list():
... for d in data:
... bunk = d**d
...
>>> def while_list():
... while data:
... d = data.pop(0)
... bunk = d**d
...
>>> data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> import timeit
>>> timeit.timeit(for_list)
1.0698931217193604
>>> timeit.timeit(while_list)
0.14515399932861328
Almost an order of magnitude faster. I've never looked at python bytecode, but I though it might be telling, but it turns out while_list
has more instructions.
So what's going on here? Is there a principle here I can apply to other programs? Are there scenarios where the for
would be ten times faster than the while
?
EDIT: As @HappyLeapSecond pointed out, I didn't quite understand exactly what was going on inside timeit
The discrepancy is gone with the following:
>>> def for_list():
... data = [x for x in range(1000)]
... for d in data:
... bunk = d**d
...
>>> def while_list():
... data = [x for x in range(1000)]
... while data:
... d = data.pop(0)
... bunk = d**d
>>> timeit.timeit(while_list, number=1000)
12.006330966949463
>>> timeit.timeit(for_list, number=1000)
11.847280025482178
Which makes it very odd that my "real" script sped up so much with such a simple change. My best guess is that the iteration method is requiring more swapping? I have a 40G swap partition, the script fills about 15-20G of it. Would popping reduce swapping?
while_list
is mutating the global data
. timeit.timeit
does not reset the value of data
. timeit.timeit
calls for_list
and while_list
a million times each by default. After the first call to while_list
, subsequent calls to while_list
return after performing 0 loops because data
is already empty.
You need to reset the value of data
before each call to for_list
and while_list
to perform a fair benchmark.
import timeit
def for_list(data):
for d in data:
bunk = d ** d
def while_list(data):
while data:
d = data.pop(0)
bunk = d ** d
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(timeit.timeit('data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; for_list(data)', 'from __main__ import for_list'))
# 0.959696054459
print(timeit.timeit('data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; while_list(data)', 'from __main__ import while_list'))
# 2.40107011795
pop(0)
is a O(n)
operation. Performing that inside a loop of length n
makes while_list
have an overall time complexity O(n**2)
, compared to O(n)
complexity for for_list
. So as expected, for_list
is faster, and the advantage grows as n
, the length of data
, gets larger.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With