To manage the amount of RAM I consume in doing an analysis, I have a large dataset stored in hdf5 (.h5) and I need to query this dataset efficiently using Pandas.
The data set contains user performance data for a suite of apps. I only want to pull a few fields out of the 40 possible, and then filter the resulting dataframe to only those users who are using a one of a few apps that interest me.
# list of apps I want to analyze
apps = ['a','d','f']
# Users.h5 contains only one field_table called 'df'
store = pd.HDFStore('Users.h5')
# the following query works fine
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1'])
# the following pseudo-query fails
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1', 'app in apps'])
I realize that the string 'app in apps' is not what I want. This is simply a SQL-like representation of what I hope to achieve. I cant seem to pass a list of strings in any way that I try, but there must be a way.
For now I am simply running the query without this parameter and then I filter out the apps I don't want in a subsequent step thusly
df = df[df['app'].isin(apps)]
But this is much less efficient since ALL of the apps need to first be loaded into memory before I can remove them. In some cases, this is big problem because I don't have enough memory to support the whole unfiltered df.
Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.
HDF5 file stands for Hierarchical Data Format 5. It is an open-source file which comes in handy to store large amount of data. As the name suggests, it stores data in a hierarchical structure within a single file.
Open a HDF5/H5 file in HDFView To begin, open the HDFView application. Within the HDFView application, select File --> Open and navigate to the folder where you saved the NEONDSTowerTemperatureData. hdf5 file on your computer. Open this file in HDFView.
HDF file stores meta data about the data stored in it so that any application can interpret the content and structure of the file. Pandas implements HDFStore interface to read, write, append, select a HDF file.
You are pretty close.
In [1]: df = DataFrame({'A' : ['foo','foo','bar','bar','baz'],
'B' : [1,2,1,2,1],
'C' : np.random.randn(5) })
In [2]: df
Out[2]:
A B C
0 foo 1 -0.909708
1 foo 2 1.321838
2 bar 1 0.368994
3 bar 2 -0.058657
4 baz 1 -1.159151
[5 rows x 3 columns]
Write the store as a table (note that in 0.12 you will use table=True
, rather than format='table'
). Remember to specify the data_columns
that you want to query when creating the table (or you can do data_columns=True
)
In [3]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=['A','B'])
In [4]: pd.read_hdf('test.h5','df')
Out[4]:
A B C
0 foo 1 -0.909708
1 foo 2 1.321838
2 bar 1 0.368994
3 bar 2 -0.058657
4 baz 1 -1.159151
[5 rows x 3 columns]
Syntax in master/0.13, isin is accomplished via query_column=list_of_values
. This is presented as a string to where.
In [8]: pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')
Out[8]:
A B C
0 foo 1 -0.909708
2 bar 1 0.368994
[2 rows x 3 columns]
Syntax in 0.12, this must be a list (which ands the conditions).
In [11]: pd.read_hdf('test.h5','df',where=[pd.Term('A','=',["foo","bar"]),'B=1'])
Out[11]:
A B C
0 foo 1 -0.909708
2 bar 1 0.368994
[2 rows x 3 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With