Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why if I put multiple empty Pandas series into hdf5 the size of hdf5 is so huge?

If I create hdf5 file with pandas with following code:

import pandas as pd

store = pd.HDFStore("store.h5")

for x in range(1000):
    store["name"+str(x)] = pd.Series()

all series are empty, so why "store.h5" file takes 1.1GB space on hardrive?

like image 262
matousc Avatar asked Jun 04 '15 19:06

matousc


1 Answers

Short version: You have found a bug. Quoting this bug on GitHub:

...required a bit of a hackjob (pytables doesn't like zero-length objects)

I can reproduce this error on my machine. Simply changing your code to this:

import pandas as pd
store = pd.HDFStore("store.h5")
for x in range(1000):
    store["name"+str(x)] = pd.Series([1,2])

results in a sane megabyte-scale file. I cannot find an open bug on Github; you might try reporting it.

I assume you've already dealt with the issue in your code, but if you haven't, you should probably just check to make sure that no array dimensions are zero before storing an object:

toStore=pd.Series()
assert not np.prod( toStore.shape )==0, 'Tried to store an empty object!'
like image 188
Andreus Avatar answered Sep 18 '22 16:09

Andreus