Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mystery when storing a dataframe containing strings in HDF with pandas

Here's something spooky with pandas and HDF for Halloween:

df = pandas.DataFrame([['a','b'] for i in range(1,1000)])
store = pandas.HDFStore('test.h5')
store['x'] = df
store.close()

then

ls -l test.h5
-rw-r--r-- 1 arthur arthur 1072080 Oct 26 10:50 test.h5

1.1M? A bit steep but why not. Here's where things get really spooky

store = pandas.HDFStore('test.h5') #open it again
store['x'] = df #do the same thing as before!
store.close()

then

ls -l test.h5
-rw-r--r-- 1 arthur arthur 2122768 Oct 26 10:52 test.h5

You've now entered the Twilight zone. Needless to say, the store is indistinguishable after the operation, but each iteration makes the file a little fattier.

It seems to only happen when there are strings involved. Before I file a bug report, I'd like to know if I'm missing something here...

like image 538
Arthur B. Avatar asked Oct 26 '12 14:10

Arthur B.


2 Answers

It seems that may be the reason: http://www.hdfgroup.org/hdf5-quest.html#del

That's one big gotcha HDF5, wtf.

like image 155
Arthur B. Avatar answered Oct 23 '22 20:10

Arthur B.


Yeah: "HDF5 is not a database". Folks often use ptrepack (part of PyTables) to "repack" the HDF5 file without any dead bytes.

like image 4
Wes McKinney Avatar answered Oct 23 '22 19:10

Wes McKinney