Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read HDF5 file to pandas DataFrame with conditions

I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.

I can explain better with an example:

Original HDF5 file would look something like:

A    B    C    D
1    0    34   11
2    0    32   15
3    1    35   22
4    1    34   15
5    1    31   9
1    0    34   15
2    1    29   11
3    0    34   15
4    1    12   14
5    0    34   15
1    0    32   13
2    1    34   15
etc  etc  etc  etc

What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only where A==1 or 3 or 4

Until now I can just load the whole HDF5 using:

store = pd.HDFStore('Resutls2015_10_21.h5')
df = pd.DataFrame(store['results_table'])

I do not see how to include a where condition here.

like image 573
codeKiller Avatar asked Oct 31 '15 13:10

codeKiller


People also ask

Can pandas read HDF5?

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format.

Is HDF5 faster than csv?

The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .

How do I explore HDF5 files?

Open a HDF5/H5 file in HDFView Open this file in HDFView. If you click on the name of the HDF5 file in the left hand window of HDFView, you can view metadata for the file. This will be located in the bottom window of the application.


2 Answers

The hdf5 file must be written in table format (as opposed to fixed format) in order to be queryable with pd.read_hdf's where argument.

Furthermore, A must be declared as a data_column:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

or, to specify all columns as (queryable) data columns:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
          format='table')

Then you could use

pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')

to select rows where the value column A is 1, 3 or 4. For example,

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
    'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
    'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
    'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))

yields

    A  B   C   D
0   1  0  34  11
2   3  1  35  22
3   4  1  34  15
5   1  0  34  15
7   3  0  34  15
8   4  1  12  14
10  1  0  32  13

If you have a very long list of values, vals, then you could use string formatting to compose the right where argument:

where='A in {}'.format(vals)
like image 181
unutbu Avatar answered Nov 02 '22 04:11

unutbu


You can do this using pandas.read_hdf (here), with the optional parameter of where.
For example: read_hdf('store_tl.h5', 'table', where = ['index>2'])

like image 27
Dean Fenster Avatar answered Nov 02 '22 04:11

Dean Fenster