I downloaded a dataset which is stored in .h5 files. I need to keep only certain columns and to be able to manipulate the data in it.
To do this, I tried to load it in a pandas dataframe. I've tried to use:
pd.read_hdf(path)
But I get: No dataset in HDF5 file.
I've found answers on SO (read HDF5 file to pandas DataFrame with conditions) but I don't need conditions, and the answer adds conditions about how the file was written but I'm not the creator of the file so I can't do anything about that.
I've also tried using h5py:
df = h5py.File(path)
But this is not easily manipulable and I can't seem to get the columns out of it (only the names of the columns using df.keys()
)
Any idea on how to do this ?
Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format.
Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.
Easiest way to read them into Pandas is to convert into h5py
, then np.array
, and then into DataFrame
. It would look something like:
df = pd.DataFrame(np.array(h5py.File(path)['variable_1']))
Pandas HDF support needs the HDF file to be formated very specifically. You can see https://stackoverflow.com/a/33644128/4128030 for more info.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With