Here's something spooky with pandas and HDF for Halloween:
df = pandas.DataFrame([['a','b'] for i in range(1,1000)])
store = pandas.HDFStore('test.h5')
store['x'] = df
store.close()
then
ls -l test.h5
-rw-r--r-- 1 arthur arthur 1072080 Oct 26 10:50 test.h5
1.1M? A bit steep but why not. Here's where things get really spooky
store = pandas.HDFStore('test.h5') #open it again
store['x'] = df #do the same thing as before!
store.close()
then
ls -l test.h5
-rw-r--r-- 1 arthur arthur 2122768 Oct 26 10:52 test.h5
You've now entered the Twilight zone. Needless to say, the store is indistinguishable after the operation, but each iteration makes the file a little fattier.
It seems to only happen when there are strings involved. Before I file a bug report, I'd like to know if I'm missing something here...
It seems that may be the reason: http://www.hdfgroup.org/hdf5-quest.html#del
That's one big gotcha HDF5, wtf.
Yeah: "HDF5 is not a database". Folks often use ptrepack (part of PyTables) to "repack" the HDF5 file without any dead bytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With