I am trying to find the best way to efficiently write large data frames (250MB+) to and from disk using Python/Pandas. I've tried all of the methods in Python for Data Analysis, but the performance has been very disappointing.
This is part of a larger project exploring migrating our current analytic/data management environment from Stata to Python. When I compare the read/write times in my tests to those that I get with Stata, Python and Pandas are typically taking more than 20 times as long.
I strongly suspect that I am the problem, not Python or Pandas.
Any suggestions?
The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.
The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
HDF5: This format of storage is best suited for storing large amounts of heterogeneous data. The data is stored as an internal file-like structure. It is also useful for randomly accessing different parts of the data. For some data structures, the size and access speed are much better than CSV.
Using HDFStore
is your best bet (not covered very much in the book, and has changed quite a lot). You will find performance is MUCH better than any other serialization method.
How to write/read various forms of HDF5
Some recipes using HDF5
Comparing performance of various writing/reading methods
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With