Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using compression with Pandas and HD5 / HDFStore

For a few aspects of a project, using "h5" storage would be ideal. However, the files are becoming massive and frankly we're running out of space.

This statement...

 store.put(storekey, data, table=False, compression='gzip')

does not produce any difference in terms of file size than...

 store.put(storekey, data, table=False)

Is using compression even possible when going through Pandas?

... if it isn't possible, I don't mind using h5py, however, I'm uncertain what to put for a "datatype" as the DataFrame contains all sorts of types (strings, float, int etc.)

Any help/insight would be appreciated!

like image 901
TravisVOX Avatar asked Aug 16 '13 13:08

TravisVOX


3 Answers

see docs in regards to compression using HDFStore

gzip is not a valid compression option (and is ignored, that's a bug). try any of zlib, bzip2, lzo, blosc (bzip2/lzo might need extra libraries installed)

see for PyTables docs on the various compression

Heres a question semi-related.

like image 131
Jeff Avatar answered Nov 15 '22 01:11

Jeff


I've ben quite a fan of HDF5 in the past, but having hit a variety of complications, especially with Pandas HDFStore, I'm starting to think Exdir is a good idea.

http://exdir.readthedocs.io

like image 24
Quentin Stafford-Fraser Avatar answered Nov 15 '22 03:11

Quentin Stafford-Fraser


You can write you data in a zipped format like this:

import pandas as pd

some_key = 'some_key'

with pd.HDFStore('path/to/your/h5/file.h5', complevel=9, complib='zlib') as store:
    store[some_key] = your_data_to_save_in_the_key

And you can read it back:

with pd.HDFStore('path/to/your/h5/file.h5', complevel=9, complib='zlib') as store:
    data_retrieved = store[some_key]
like image 1
Alexander Martins Avatar answered Nov 15 '22 03:11

Alexander Martins