I'm reading a large csv file (more than 4 million rows) using the invaluable csv
module in Python. In timing various approaches, I've come across an unintuitive result.
If I run the following script it takes about 11-12 seconds. b
is created almost instantly after a
.
r = csv.reader(open("data.csv", "rb"), delimiter=";")
a = [None for row in r]
b = [row for row in r]
But if I run a similar script that doesn't create a
at all, the code takes longer (21-22 seconds):
r = csv.reader(open("data.csv", "rb"), delimiter=";")
b = [row for row in r]
I can comprehend why the creation of b
takes almost no time after a
has already been created. But I would have thought (naively) that the second code block, in which only b
is created and not a
, would be the faster script. At the risk of appearing un-Pythonic, I'm interested to know if anyone can explain why creating a
and then b
is almost twice as fast as creating b
alone.
Furthermore, if this speed boost is consistent across more complicated operations, are there good reasons (other than style/readibility concerns) not to take advantage of it? Are savvier Python programmers than I already achieving the same time savings with some conventional method I've never heard of?
If I construct a
using, say, an integer instead of None
, I get the same result. If instead of iterating over a csv.reader
object I iterate over open("data.csv", "rb").readlines()
, the timing is as I expect it to be: creating b
alone is faster than creating a
then b
. So the time disparity presumably has something to do with the properties of a csv.reader
object, or of a more general class of objects like it. If I create b
prior to a
, the time is about the same as if I created b
alone.
Some notes:
b
prior to a
takes the same time as creating b
alone.r
, or a list of the rows in r
.Have you looked at b
in your first example? It's empty because r
was exhausted by the first list comprehension. All rows have already been iterated over, and - as @soulcheck has pointed out - it's much faster to create a list of 4 million None
s than a list that contains 4 million sublists.
This might give some insight. Let's take a short example of a csv file with 10 lines and compare this:
import csv
from collections import Counter
r = csv.reader(open('foo.csv'))
a = [id(row) for row in r]
r = csv.reader(open('foo.csv'))
b = [row for row in r]
b_id = [id(row) for row in b]
c1 = Counter(a)
c2 = Counter(b_id)
print c1
print c2
This results in
Counter({139713821424456: 5, 139713821196512: 5})
Counter({139713821196512: 1, 139713821669136: 1, 139713821668776: 1, 139713821196584: 1, 139713821669064: 1, 139713821668560: 1, 139713821658792: 1, 139713821668704: 1, 139713821668848: 1, 139713821668632: 1})
In other words, in a
, we reused the same memory over and over. Since the list comprehension for a
does not retain any reference to the row
, it will be garbage collected right away, opening that memory up for reuse. If we hold onto it, naturally, we'll have to allocate memory for each new list.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With