I'm trying to overwrite the pandas dataframe in hdf5 file. Each time I do this, the file size grows up while the stored frame content is the same. If I use mode='w' I lost all other records. Is this a bug or am I missing something?
import pandas
df = pandas.read_csv('1.csv')
for i in range(100):
store = pandas.HDFStore('tmp.h5')
store.put('TMP', df)
store.close()
The tmp.h5 grows in size.
The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .
Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format.
The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.
This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.
Read the big warning at the bottom of this section
This is how HDF5 works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With