Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: fastest way to retrieve comma-separated data in a file

Tags:

python

I have a file of a couple hundred thousand lines which looks like this:

01,T,None,Red,Big
02,F,None,Purple,Small
03,T,None,Blue,Big
.......

I want something that will retrieve the n th column from the whole file. For example, the 4th column would be:

Red
Purple
Blue

Since the file is very big, I am interested in knowing the most efficient way to do this.

The obvious solution would be to go through the file line by line, then apply split(',') and get the 4th item in the array, but I am wondering if there is anything slightly better.

like image 519
cgf Avatar asked Oct 18 '13 02:10

cgf


Video Answer


1 Answers

I don't think you can improve on just reading the file and using str.split(). However, you haven't shown us all your code... you might want to make sure you aren't reading the entire file into memory before working on it (using the file.readlines() method function or file.read()).

Something like this is probably about as good as you can do:

with open(filename, "rt") as f:
    for line in f:
        x = line.split(',')[3]
        # do something with x

If you want to be able to treat an input file as if it contained only one column, I suggest wrapping the above in a function that uses yield to provide the values.

def get_col3(f):
    for line in f:
        yield line.split(',')[3]

with open(filename, "rt") as f:
    for x in get_col3(f):
        # do something with x

Given that the file I/O stuff is part of the C guts of Python, you probably can't pick up too much extra speed by being tricky. But you could try writing a simple C program that reads a file, finds the fourth column, and prints it to standard output, then pipe that into a Python program.

If you will be working with the same input file a lot, it would probably make sense to save it in some sort of binary file format that is faster than parsing a text file. I believe the science guys who work with really large data sets like HDF5, and Python has good support for that through Pandas.

http://pandas.pydata.org/

http://www.hdfgroup.org/HDF5/

Hmm, now that I think about it: you should try using Pandas to import that text file. I remember the author of Pandas saying he had written some low-level code that greatly accelerated parsing input files.

Oh, found it: http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/

Hmm. Looking in the Pandas documentation, it appears you can use read_csv() with an optional argument usecols to specify a subset of columns you want, and it will throw away everything else.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

The reason I think Pandas might win for speed: when you call line.split(','), Python will build a string object for each of the columns, plus build a list for you. Then you index the list to grab the one string you need, and Python will destroy the list and destroy the objects it created (other than the column you wanted). This "churn" in Python's object pool takes some time, and you multiply that time by the number of lines in the file. Pandas can parse the lines, and return to Python only the lines you need, and it might therefore win.

But all this is mere speculation. The rule to speed things up is: measure. Run code, measure how fast it is, then run the other code and measure, see if the speedup is worth it.

like image 68
steveha Avatar answered Oct 05 '22 22:10

steveha