Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

With python csv module, why does creating a list of identical values speed up creation of a list of rows?

Tags:

python

csv

I'm reading a large csv file (more than 4 million rows) using the invaluable csv module in Python. In timing various approaches, I've come across an unintuitive result.

If I run the following script it takes about 11-12 seconds. b is created almost instantly after a.

r = csv.reader(open("data.csv", "rb"), delimiter=";")
a = [None for row in r]
b = [row for row in r]

But if I run a similar script that doesn't create a at all, the code takes longer (21-22 seconds):

r = csv.reader(open("data.csv", "rb"), delimiter=";")
b = [row for row in r]

I can comprehend why the creation of b takes almost no time after a has already been created. But I would have thought (naively) that the second code block, in which only b is created and not a, would be the faster script. At the risk of appearing un-Pythonic, I'm interested to know if anyone can explain why creating a and then b is almost twice as fast as creating b alone.

Furthermore, if this speed boost is consistent across more complicated operations, are there good reasons (other than style/readibility concerns) not to take advantage of it? Are savvier Python programmers than I already achieving the same time savings with some conventional method I've never heard of?

If I construct a using, say, an integer instead of None, I get the same result. If instead of iterating over a csv.reader object I iterate over open("data.csv", "rb").readlines(), the timing is as I expect it to be: creating b alone is faster than creating a then b. So the time disparity presumably has something to do with the properties of a csv.reader object, or of a more general class of objects like it. If I create b prior to a, the time is about the same as if I created b alone.

Some notes:

  • Creating b prior to a takes the same time as creating b alone.
  • I'm not running these line-by-line in interactive mode. I'm running each as a separate script.
  • I'm not really trying to create a list full of ones with the same length as r, or a list of the rows in r.
  • In case it matters, I'm running Python 2.7.3, using the Enthought Python distribution 7.3-2, on 64-bit Windows 7.
like image 929
ASGM Avatar asked Jan 13 '23 17:01

ASGM


2 Answers

Have you looked at b in your first example? It's empty because r was exhausted by the first list comprehension. All rows have already been iterated over, and - as @soulcheck has pointed out - it's much faster to create a list of 4 million Nones than a list that contains 4 million sublists.

like image 79
Tim Pietzcker Avatar answered Jan 30 '23 01:01

Tim Pietzcker


This might give some insight. Let's take a short example of a csv file with 10 lines and compare this:

import csv
from collections import Counter

r = csv.reader(open('foo.csv'))
a = [id(row) for row in r]

r = csv.reader(open('foo.csv'))
b = [row for row in r]
b_id = [id(row) for row in b]

c1 = Counter(a)
c2 = Counter(b_id)

print c1
print c2

This results in

Counter({139713821424456: 5, 139713821196512: 5})
Counter({139713821196512: 1, 139713821669136: 1, 139713821668776: 1, 139713821196584: 1, 139713821669064: 1, 139713821668560: 1, 139713821658792: 1, 139713821668704: 1, 139713821668848: 1, 139713821668632: 1})

In other words, in a, we reused the same memory over and over. Since the list comprehension for a does not retain any reference to the row, it will be garbage collected right away, opening that memory up for reuse. If we hold onto it, naturally, we'll have to allocate memory for each new list.

like image 41
FatalError Avatar answered Jan 30 '23 02:01

FatalError