I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows. I can explain better with an example: Original HDF5 file would look something like: <pre class="prettyprint"><code>A B C D 1 0 34 11 2 0 32 15 3 1 35 22 4 1 34 15 5 1 31 9 1 0 34 15 2 1 29 11 3 0 34 15 4 1 12 14 5 0 34 15 1 0 32 13 2 1 34 15 etc etc etc etc </code></pre> What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only <code>where A==1 or 3 or 4</code> Until now I can just load the whole HDF5 using: <pre class="prettyprint"><code>store = pd.HDFStore('Resutls2015_10_21.h5') df = pd.DataFrame(store['results_table']) </code></pre> I do not see how to include a <code>where</code> condition here.

The <code>hdf5</code> file must be written in <code>table</code> format (as opposed to <code>fixed</code> format) in order to be queryable with <code>pd.read_hdf</code>'s <code>where</code> argument. Furthermore, <code>A</code> must be declared as a data_column: <pre class="prettyprint"><code>df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'], format='table') </code></pre> or, to specify all columns as (queryable) data columns: <pre class="prettyprint"><code>df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True, format='table') </code></pre> Then you could use <pre class="prettyprint"><code>pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]') </code></pre> to select rows where the value column <code>A</code> is 1, 3 or 4. For example, <pre class="prettyprint"><code>import numpy as np import pandas as pd df = pd.DataFrame({ 'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2], 'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1], 'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34], 'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]}) df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'], format='table') print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')) </code></pre> yields <pre class="prettyprint"><code> A B C D 0 1 0 34 11 2 3 1 35 22 3 4 1 34 15 5 1 0 34 15 7 3 0 34 15 8 4 1 12 14 10 1 0 32 13 </code></pre> <hr> If you have a very long list of values, <code>vals</code>, then you could use string formatting to compose the right <code>where</code> argument: <pre class="prettyprint"><code>where='A in {}'.format(vals) </code></pre>

You can do this using <code>pandas.read_hdf</code> (here), with the optional parameter of <code>where</code>. For example: <code>read_hdf('store_tl.h5', 'table', where = ['index>2'])</code>

read HDF5 file to pandas DataFrame with conditions

Tags:

python

pandas

hdf5

I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.

I can explain better with an example:

Original HDF5 file would look something like:

A    B    C    D
1    0    34   11
2    0    32   15
3    1    35   22
4    1    34   15
5    1    31   9
1    0    34   15
2    1    29   11
3    0    34   15
4    1    12   14
5    0    34   15
1    0    32   13
2    1    34   15
etc  etc  etc  etc

What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only where A==1 or 3 or 4

Until now I can just load the whole HDF5 using:

store = pd.HDFStore('Resutls2015_10_21.h5')
df = pd.DataFrame(store['results_table'])

I do not see how to include a where condition here.

573

asked Oct 31 '15 13:10

codeKiller

2 Answers

The hdf5 file must be written in table format (as opposed to fixed format) in order to be queryable with pd.read_hdf's where argument.

Furthermore, A must be declared as a data_column:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

or, to specify all columns as (queryable) data columns:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
          format='table')

Then you could use

pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')

to select rows where the value column A is 1, 3 or 4. For example,

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
    'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
    'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
    'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))

yields

    A  B   C   D
0   1  0  34  11
2   3  1  35  22
3   4  1  34  15
5   1  0  34  15
7   3  0  34  15
8   4  1  12  14
10  1  0  32  13

If you have a very long list of values, vals, then you could use string formatting to compose the right where argument:

where='A in {}'.format(vals)

181

answered Nov 02 '22 04:11

unutbu

You can do this using pandas.read_hdf (here), with the optional parameter of where.
For example: read_hdf('store_tl.h5', 'table', where = ['index>2'])

answered Nov 02 '22 04:11

Dean Fenster

Related questions
                            
                                imp.reload - NoneType object has no attribute 'name'
                            
                                SQLAlchemy: Getting a single object from joining multiple tables
                            
                                python create temp file namedtemporaryfile and call subprocess on it
                            
                                Python giving FileNotFoundError for file name returned by os.listdir
                            
                                Open spyder in ubuntu
                            
                                Specify path of savefig with pylab or matplotlib
                            
                                Subtract mean from image
                            
                                SkLearn Multinomial NB: Most Informative Features
                            
                                How can I log from Python to syslog with either SysLogHandler or syslog on Mac OS X *and* Debian (7)
                            
                                Is there a Counter in Python that can accumulate negative values? [duplicate]
                            
                                Testing Flask web app with unittest POST Error 500
                            
                                Peewee insert if not exist
                            
                                Memory growth with broadcast operations in NumPy
                            
                                is there a way to save bokeh data table content
                            
                                Python Matplotlib Multi-color Legend Entry
                            
                                Create dynamic arguments for url_for in Flask
                            
                                naming convention: What does the 'm' mean in libpython3.5m.dylib
                            
                                Create child processes inside a child process with Python multiprocessing failed
                            
                                rabbitmq multiple consumers on a queue- only one get the message
                            
                                What is the advantage of flask.logger over the more generic python logging module?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With