Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDF5 file grows in size after overwriting the pandas dataframe

I'm trying to overwrite the pandas dataframe in hdf5 file. Each time I do this, the file size grows up while the stored frame content is the same. If I use mode='w' I lost all other records. Is this a bug or am I missing something?

import pandas
df = pandas.read_csv('1.csv')
for i in range(100):
  store = pandas.HDFStore('tmp.h5')
  store.put('TMP', df)
  store.close()

The tmp.h5 grows in size.

like image 738
Sergey Sergienko Avatar asked Oct 13 '15 11:10

Sergey Sergienko


People also ask

Is HDF5 faster than CSV?

The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .

Can pandas read HDF5?

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format.

Is there a limit to the size of a Pandas Dataframe?

The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.

Why are h5 files so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.


1 Answers

Read the big warning at the bottom of this section

This is how HDF5 works.

like image 123
Jeff Avatar answered Sep 30 '22 10:09

Jeff