Fastest way to parse large CSV files in Pandas

Question

I am using pandas to analyse large CSV data files. They are around 100 megs in size.

Each load from csv takes a few seconds, and then more time to convert the dates.

I have tried loading the files, converting the dates from strings to datetimes, and then re-saving them as pickle files. But loading those takes a few seconds as well.

What fast methods could I use to load/save the data from disk?

joris · Accepted Answer

As @chrisb said, pandas' read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don't think you will find something better to parse the csv (as a note, read_csv is not a 'pure python' solution, as the CSV parser is implemented in C).

But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas (with PyTables in background) to query that efficiently (docs).
See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

And a possibly relevant other question: "Large data" work flows using pandas

Fastest way to parse large CSV files in Pandas

Tags:

python

pandas

Ginger

1 Answers

joris

Recent Activity

Donate For Us

Fastest way to parse large CSV files in Pandas

Tags:

python

pandas

Ginger

1 Answers

joris

Related questions

Recent Activity

Donate For Us