Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently writing large Pandas data frames to disk

Tags:

python

pandas

I am trying to find the best way to efficiently write large data frames (250MB+) to and from disk using Python/Pandas. I've tried all of the methods in Python for Data Analysis, but the performance has been very disappointing.

This is part of a larger project exploring migrating our current analytic/data management environment from Stata to Python. When I compare the read/write times in my tests to those that I get with Stata, Python and Pandas are typically taking more than 20 times as long.

I strongly suspect that I am the problem, not Python or Pandas.

Any suggestions?

like image 788
user2928791 Avatar asked Oct 28 '13 16:10

user2928791


People also ask

How big is too big for a Pandas DataFrame?

The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.

Is pandas efficient for large data sets?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

Is HDF5 better than CSV?

HDF5: This format of storage is best suited for storing large amounts of heterogeneous data. The data is stored as an internal file-like structure. It is also useful for randomly accessing different parts of the data. For some data structures, the size and access speed are much better than CSV.


1 Answers

Using HDFStore is your best bet (not covered very much in the book, and has changed quite a lot). You will find performance is MUCH better than any other serialization method.

  • How to write/read various forms of HDF5

  • Some recipes using HDF5

  • Comparing performance of various writing/reading methods

like image 85
Jeff Avatar answered Oct 14 '22 04:10

Jeff