Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas, large data, HDF tables and memory usage when calling a function

Short question

When Pandas work on a HDFStore (eg: .mean() or .apply() ), does it load the full data in memory as a DataFrame, or does it process record-by-record as a Serie?

Long description

I have to work on large data files, and I can specify the output format of the data file.

I intend to use Pandas to process the data, and I would like to setup the best format so that it maximizes the performances.

I have seen that panda.read_table() has gone a long way, but it still at least takes at least as much memory (in fact at least twice the memory) as the original file size that we want to read to transform into a DataFrame. This may work for files up to 1 GB, but above? That may be hard, especially on online shared machines.

However, I have seen that now Pandas seems to support HDF tables using pytables.

My question is: how does Pandas manage the memory when we do an operation on a whole HDF table? For example a .mean() or .apply(). Does it first load the entire table in a DataFrame, or does it process the function over data directly from the HDF file without storing in memory?

Side-question: is the hdf5 format compact on disk usage? I mean, is it verbose like xml or more like JSON? (I know there are indexes and stuff, but I am here interested in the bare description of the data)

like image 989
gaborous Avatar asked Mar 28 '13 22:03

gaborous


1 Answers

I think I have found the answer: yes and no, it depends on how you load your Pandas DataFrame.

As with the read_table() method, you have an "iterator" argument which allows to get a generator object which will get only one record at a time, as explained here: http://pandas.pydata.org/pandas-docs/dev/io.html#iterator

Now, I don't know how functions like .mean() and .apply() would work with these generators.

If someone has more info/experience, feel free to share!

About HDF5 overhead:

HDF5 keeps a B-tree in memory that is used to map chunk structures on disk. The more chunks that are allocated for a dataset the larger the B-tree. Large B-trees take memory and cause file storage overhead as well as more disk I/O and higher contention forthe metadata cache. Consequently, it’s important to balance between memory and I/O overhead (small B-trees) and time to access data (big B-trees).

http://pytables.github.com/usersguide/optimization.html

like image 187
gaborous Avatar answered Sep 26 '22 02:09

gaborous