Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to parse large CSV files in Pandas

Tags:

python

pandas

I am using pandas to analyse large CSV data files. They are around 100 megs in size.

Each load from csv takes a few seconds, and then more time to convert the dates.

I have tried loading the files, converting the dates from strings to datetimes, and then re-saving them as pickle files. But loading those takes a few seconds as well.

What fast methods could I use to load/save the data from disk?

like image 319
Ginger Avatar asked Aug 26 '14 14:08

Ginger


1 Answers

As @chrisb said, pandas' read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don't think you will find something better to parse the csv (as a note, read_csv is not a 'pure python' solution, as the CSV parser is implemented in C).

But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas (with PyTables in background) to query that efficiently (docs).
See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

And a possibly relevant other question: "Large data" work flows using pandas

like image 122
joris Avatar answered Oct 04 '22 08:10

joris