How to query an HDF store using Pandas/Python

Tags:

To manage the amount of RAM I consume in doing an analysis, I have a large dataset stored in hdf5 (.h5) and I need to query this dataset efficiently using Pandas.

The data set contains user performance data for a suite of apps. I only want to pull a few fields out of the 40 possible, and then filter the resulting dataframe to only those users who are using a one of a few apps that interest me.

# list of apps I want to analyze
apps = ['a','d','f']

# Users.h5 contains only one field_table called 'df'
store = pd.HDFStore('Users.h5')

# the following query works fine
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1'])

# the following pseudo-query fails
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1', 'app in apps'])

I realize that the string 'app in apps' is not what I want. This is simply a SQL-like representation of what I hope to achieve. I cant seem to pass a list of strings in any way that I try, but there must be a way.

For now I am simply running the query without this parameter and then I filter out the apps I don't want in a subsequent step thusly

df = df[df['app'].isin(apps)]

But this is much less efficient since ALL of the apps need to first be loaded into memory before I can remove them. In some cases, this is big problem because I don't have enough memory to support the whole unfiltered df.

745

asked Nov 28 '13 00:11

Tyler Greene

1 Answers

You are pretty close.

In [1]: df = DataFrame({'A' : ['foo','foo','bar','bar','baz'],
                        'B' : [1,2,1,2,1], 
                        'C' : np.random.randn(5) })

In [2]: df
Out[2]: 
     A  B         C
0  foo  1 -0.909708
1  foo  2  1.321838
2  bar  1  0.368994
3  bar  2 -0.058657
4  baz  1 -1.159151

[5 rows x 3 columns]

Write the store as a table (note that in 0.12 you will use table=True, rather than format='table'). Remember to specify the data_columns that you want to query when creating the table (or you can do data_columns=True)

In [3]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=['A','B'])

In [4]: pd.read_hdf('test.h5','df')
Out[4]: 
     A  B         C
0  foo  1 -0.909708
1  foo  2  1.321838
2  bar  1  0.368994
3  bar  2 -0.058657
4  baz  1 -1.159151

[5 rows x 3 columns]

Syntax in master/0.13, isin is accomplished via query_column=list_of_values. This is presented as a string to where.

In [8]: pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')
Out[8]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]

Syntax in 0.12, this must be a list (which ands the conditions).

In [11]: pd.read_hdf('test.h5','df',where=[pd.Term('A','=',["foo","bar"]),'B=1'])
Out[11]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]

155

answered Oct 19 '22 21:10

Jeff

Related questions
                            
                                Convert integer to binary in python and compare the bits
                            
                                How to use super() when subclassing Tkinter widgets? [duplicate]
                            
                                Cannot convert array to floats python
                            
                                Matplotlib: using a figure object to initialize a plot
                            
                                basemap: How to remove actual lat/lon lines while keeping the ticks on the axis
                            
                                how to compute a new column based on the values of other columns in pandas - python
                            
                                Reducing memory used by a large dict
                            
                                Python escape character
                            
                                Opencv draws numpy.zeros as a gray image
                            
                                Python function is changing the value of my input, and I can't figure out why
                            
                                Shebang executable not found because of UTF-8 BOM (Byte Order Mark)
                            
                                Kivy - base application has strange alignment
                            
                                My code nests too deep. Is there a better way?
                            
                                Word boundary to use in unicode text for Python regex
                            
                                How to generate random numbers in specyfic range using pareto distribution in Python
                            
                                Python Django send_mail newlines?
                            
                                Passing a value to WTForms field with Jinja2
                            
                                Combine date column and time column into datetime column
                            
                                Is there a simple way to switch between using and ignoring metacharacters in Python regular expressions?
                            
                                Run a script to populate a django db

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to query an HDF store using Pandas/Python

Tags:

python

pandas

hdfs

Tyler Greene

People also ask

1 Answers

Jeff

Recent Activity

Donate For Us