I am using pandas to analyse large CSV data files. They are around 100 megs in size.
Each load from csv takes a few seconds, and then more time to convert the dates.
I have tried loading the files, converting the dates from strings to datetimes, and then re-saving them as pickle files. But loading those takes a few seconds as well.
What fast methods could I use to load/save the data from disk?
As @chrisb said, pandas' read_csv
is probably faster than csv.reader/numpy.genfromtxt/loadtxt
. I don't think you will find something better to parse the csv (as a note, read_csv
is not a 'pure python' solution, as the CSV parser is implemented in C).
But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas
(with PyTables
in background) to query that efficiently (docs).
See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations
And a possibly relevant other question: "Large data" work flows using pandas
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With