Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to query an HDF store using Pandas/Python

To manage the amount of RAM I consume in doing an analysis, I have a large dataset stored in hdf5 (.h5) and I need to query this dataset efficiently using Pandas.

The data set contains user performance data for a suite of apps. I only want to pull a few fields out of the 40 possible, and then filter the resulting dataframe to only those users who are using a one of a few apps that interest me.

# list of apps I want to analyze
apps = ['a','d','f']

# Users.h5 contains only one field_table called 'df'
store = pd.HDFStore('Users.h5')

# the following query works fine
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1'])

# the following pseudo-query fails
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1', 'app in apps'])

I realize that the string 'app in apps' is not what I want. This is simply a SQL-like representation of what I hope to achieve. I cant seem to pass a list of strings in any way that I try, but there must be a way.

For now I am simply running the query without this parameter and then I filter out the apps I don't want in a subsequent step thusly

df = df[df['app'].isin(apps)]

But this is much less efficient since ALL of the apps need to first be loaded into memory before I can remove them. In some cases, this is big problem because I don't have enough memory to support the whole unfiltered df.

like image 745
Tyler Greene Avatar asked Nov 28 '13 00:11

Tyler Greene


People also ask

What is HDF in pandas?

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

What is HDF in Python?

HDF5 file stands for Hierarchical Data Format 5. It is an open-source file which comes in handy to store large amount of data. As the name suggests, it stores data in a hierarchical structure within a single file.

How do I open an HDF5 file?

Open a HDF5/H5 file in HDFView To begin, open the HDFView application. Within the HDFView application, select File --> Open and navigate to the folder where you saved the NEONDSTowerTemperatureData. hdf5 file on your computer. Open this file in HDFView.

What is HDFStore?

HDF file stores meta data about the data stored in it so that any application can interpret the content and structure of the file. Pandas implements HDFStore interface to read, write, append, select a HDF file.


1 Answers

You are pretty close.

In [1]: df = DataFrame({'A' : ['foo','foo','bar','bar','baz'],
                        'B' : [1,2,1,2,1], 
                        'C' : np.random.randn(5) })

In [2]: df
Out[2]: 
     A  B         C
0  foo  1 -0.909708
1  foo  2  1.321838
2  bar  1  0.368994
3  bar  2 -0.058657
4  baz  1 -1.159151

[5 rows x 3 columns]

Write the store as a table (note that in 0.12 you will use table=True, rather than format='table'). Remember to specify the data_columns that you want to query when creating the table (or you can do data_columns=True)

In [3]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=['A','B'])

In [4]: pd.read_hdf('test.h5','df')
Out[4]: 
     A  B         C
0  foo  1 -0.909708
1  foo  2  1.321838
2  bar  1  0.368994
3  bar  2 -0.058657
4  baz  1 -1.159151

[5 rows x 3 columns]

Syntax in master/0.13, isin is accomplished via query_column=list_of_values. This is presented as a string to where.

In [8]: pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')
Out[8]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]

Syntax in 0.12, this must be a list (which ands the conditions).

In [11]: pd.read_hdf('test.h5','df',where=[pd.Term('A','=',["foo","bar"]),'B=1'])
Out[11]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]
like image 155
Jeff Avatar answered Oct 19 '22 21:10

Jeff