HDF5 taking more space than CSV?

Prepare the data:

import string import random import pandas as pd  matrix = np.random.random((100, 3000)) my_cols = [random.choice(string.ascii_uppercase) for x in range(matrix.shape[1])] mydf = pd.DataFrame(matrix, columns=my_cols) mydf['something'] = 'hello_world'

Set the highest compression possible for HDF5:

store = pd.HDFStore('myfile.h5',complevel=9, complib='bzip2') store['mydf'] = mydf store.close()

Save also to CSV:

mydf.to_csv('myfile.csv', sep=':')

The result is:

myfile.csv is 5.6 MB big
myfile.h5 is 11 MB big

The difference grows bigger as the datasets get larger.

I have tried with other compression methods and levels. Is this a bug? (I am using Pandas 0.11 and the latest stable version of HDF5 and Python).

816

asked May 19 '13 21:05

Amelio Vazquez-Reina

1 Answers

Copy of my answer from the issue: https://github.com/pydata/pandas/issues/3651

Your sample is really too small. HDF5 has a fair amount of overhead with really small sizes (even 300k entries is on the smaller side). The following is with no compression on either side. Floats are really more efficiently represented in binary (that as a text representation).

In addition, HDF5 is row based. You get MUCH efficiency by having tables that are not too wide but are fairly long. (Hence your example is not very efficient in HDF5 at all, store it transposed in this case)

I routinely have tables that are 10M+ rows and query times can be in the ms. Even the below example is small. Having 10+GB files is quite common (not to mention the astronomy guys who 10GB+ is a few seconds!)

-rw-rw-r--  1 jreback users 203200986 May 19 20:58 test.csv -rw-rw-r--  1 jreback users  88007312 May 19 20:59 test.h5  In [1]: df = DataFrame(randn(1000000,10))  In [9]: df Out[9]:  <class 'pandas.core.frame.DataFrame'> Int64Index: 1000000 entries, 0 to 999999 Data columns (total 10 columns): 0    1000000  non-null values 1    1000000  non-null values 2    1000000  non-null values 3    1000000  non-null values 4    1000000  non-null values 5    1000000  non-null values 6    1000000  non-null values 7    1000000  non-null values 8    1000000  non-null values 9    1000000  non-null values dtypes: float64(10)  In [5]: %timeit df.to_csv('test.csv',mode='w') 1 loops, best of 3: 12.7 s per loop  In [6]: %timeit df.to_hdf('test.h5','df',mode='w') 1 loops, best of 3: 825 ms per loop  In [7]: %timeit pd.read_csv('test.csv',index_col=0) 1 loops, best of 3: 2.35 s per loop  In [8]: %timeit pd.read_hdf('test.h5','df') 10 loops, best of 3: 38 ms per loop

I really wouldn't worry about the size (I suspect you are not, but are merely interested, which is fine). The point of HDF5 is that disk is cheap, cpu is cheap, but you can't have everything in memory at once so we optimize by using chunking

197

answered Sep 24 '22 19:09

Jeff

Related questions
                            
                                'verbose' argument in scikit-learn
                            
                                How to understand loss acc val_loss val_acc in Keras model fitting
                            
                                Interactive console using Pydev in Eclipse?
                            
                                Numpy modify array in place?
                            
                                When to call .join() on a process?
                            
                                Does pandas need to close connection?
                            
                                Use binary COPY table FROM with psycopg2
                            
                                Create python soap server based on wsdl
                            
                                Using numpy.genfromtxt to read a csv file with strings containing commas
                            
                                argparse "compulsory" optional arguments
                            
                                performing set operations on custom classes in python
                            
                                Can I get a return value from multiprocessing.Process?
                            
                                How to implement a lazy setdefault?
                            
                                PyQt proper use of emit() and pyqtSignal()
                            
                                Dict merge in a dict comprehension
                            
                                How do I use a TimeSeriesSplit with a GridSearchCV object to tune a model in scikit-learn?
                            
                                Compile Cython Extensions Error - Pycharm IDE
                            
                                python image recognition [closed]
                            
                                Is Python's "with" monadic?
                            
                                What is the difference between PyCharm Virtual Environment and Anaconda Environment?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HDF5 taking more space than CSV?

Tags:

python

pandas

hdf5

pytables