I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.
I can explain better with an example:
Original HDF5 file would look something like:
A B C D
1 0 34 11
2 0 32 15
3 1 35 22
4 1 34 15
5 1 31 9
1 0 34 15
2 1 29 11
3 0 34 15
4 1 12 14
5 0 34 15
1 0 32 13
2 1 34 15
etc etc etc etc
What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only where A==1 or 3 or 4
Until now I can just load the whole HDF5 using:
store = pd.HDFStore('Resutls2015_10_21.h5')
df = pd.DataFrame(store['results_table'])
I do not see how to include a where
condition here.
Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format.
The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .
Open a HDF5/H5 file in HDFView Open this file in HDFView. If you click on the name of the HDF5 file in the left hand window of HDFView, you can view metadata for the file. This will be located in the bottom window of the application.
The hdf5
file must be written in table
format (as opposed to fixed
format) in
order to be queryable with pd.read_hdf
's where
argument.
Furthermore, A
must be declared as a data_column:
df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
format='table')
or, to specify all columns as (queryable) data columns:
df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
format='table')
Then you could use
pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')
to select rows where the value column A
is 1, 3 or 4. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})
df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
format='table')
print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))
yields
A B C D
0 1 0 34 11
2 3 1 35 22
3 4 1 34 15
5 1 0 34 15
7 3 0 34 15
8 4 1 12 14
10 1 0 32 13
If you have a very long list of values, vals
, then you could use string formatting to compose the right where
argument:
where='A in {}'.format(vals)
You can do this using pandas.read_hdf
(here), with the optional parameter of where
.
For example: read_hdf('store_tl.h5', 'table', where = ['index>2'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With