Consider the following example:
import string import random import pandas as pd matrix = np.random.random((100, 3000)) my_cols = [random.choice(string.ascii_uppercase) for x in range(matrix.shape[1])] mydf = pd.DataFrame(matrix, columns=my_cols) mydf['something'] = 'hello_world'
store = pd.HDFStore('myfile.h5',complevel=9, complib='bzip2') store['mydf'] = mydf store.close()
mydf.to_csv('myfile.csv', sep=':')
The result is:
myfile.csv
is 5.6 MB bigmyfile.h5
is 11 MB bigThe difference grows bigger as the datasets get larger.
I have tried with other compression methods and levels. Is this a bug? (I am using Pandas 0.11 and the latest stable version of HDF5 and Python).
This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.
HDF5 stores data in binary format native to a computing platform but portable across platforms. The binary format native to computers makes the format the more efficient for computers than text formats (e.g., . txt or . csv) that is meant for humans to read.
Supports Large, Complex Data: HDF5 is a compressed format that is designed to support large, heterogeneous, and complex datasets. Supports Data Slicing: "Data slicing", or extracting portions of the dataset as needed for analysis, means large files don't need to be completely read into the computers memory or RAM.
display import clear_output CHUNK_SIZE = 5000000 filename = 'data. csv' dtypes = {'latitude': float, 'longitude': float} iter_csv = pd. read_csv( filename, iterator=True, dtype=dtypes, encoding='utf-8', chunksize=CHUNK_SIZE) cnt = 0 for ix, chunk in enumerate(iter_csv): chunk.
Copy of my answer from the issue: https://github.com/pydata/pandas/issues/3651
Your sample is really too small. HDF5 has a fair amount of overhead with really small sizes (even 300k entries is on the smaller side). The following is with no compression on either side. Floats are really more efficiently represented in binary (that as a text representation).
In addition, HDF5 is row based. You get MUCH efficiency by having tables that are not too wide but are fairly long. (Hence your example is not very efficient in HDF5 at all, store it transposed in this case)
I routinely have tables that are 10M+ rows and query times can be in the ms. Even the below example is small. Having 10+GB files is quite common (not to mention the astronomy guys who 10GB+ is a few seconds!)
-rw-rw-r-- 1 jreback users 203200986 May 19 20:58 test.csv -rw-rw-r-- 1 jreback users 88007312 May 19 20:59 test.h5 In [1]: df = DataFrame(randn(1000000,10)) In [9]: df Out[9]: <class 'pandas.core.frame.DataFrame'> Int64Index: 1000000 entries, 0 to 999999 Data columns (total 10 columns): 0 1000000 non-null values 1 1000000 non-null values 2 1000000 non-null values 3 1000000 non-null values 4 1000000 non-null values 5 1000000 non-null values 6 1000000 non-null values 7 1000000 non-null values 8 1000000 non-null values 9 1000000 non-null values dtypes: float64(10) In [5]: %timeit df.to_csv('test.csv',mode='w') 1 loops, best of 3: 12.7 s per loop In [6]: %timeit df.to_hdf('test.h5','df',mode='w') 1 loops, best of 3: 825 ms per loop In [7]: %timeit pd.read_csv('test.csv',index_col=0) 1 loops, best of 3: 2.35 s per loop In [8]: %timeit pd.read_hdf('test.h5','df') 10 loops, best of 3: 38 ms per loop
I really wouldn't worry about the size (I suspect you are not, but are merely interested, which is fine). The point of HDF5 is that disk is cheap, cpu is cheap, but you can't have everything in memory at once so we optimize by using chunking
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With