When selecting rows whose column value column_name
equals a scalar, some_value
, we use ==:
df.loc[df['column_name'] == some_value]
or use .query()
df.query('column_name == some_value')
In a concrete example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Col1': 'what are men to rocks and mountains'.split(),
'Col2': 'the curves of your lips rewrite history.'.split(),
'Col3': np.arange(7),
'Col4': np.arange(7) * 8})
print(df)
Col1 Col2 Col3 Col4
0 what the 0 0
1 are curves 1 8
2 men of 2 16
3 to your 3 24
4 rocks lips 4 32
5 and rewrite 5 40
6 mountains history 6 48
A query could be
rocks_row = df.loc[df['Col1'] == "rocks"]
which outputs
print(rocks_row)
Col1 Col2 Col3 Col4
4 rocks lips 4 32
I would like to pass through a list of values to query against a dataframe, which outputs a list of "correct queries".
The queries to execute would be in a list, e.g.
list_match = ['men', 'curves', 'history']
which would output all rows which meet this condition, i.e.
matches = pd.concat([df1, df2, df3])
where
df1 = df.loc[df['Col1'] == "men"]
df2 = df.loc[df['Col1'] == "curves"]
df3 = df.loc[df['Col1'] == "history"]
My idea would be to create a function that takes in a
output = []
def find_queries(dataframe, column, value, output):
for scalar in value:
query = dataframe.loc[dataframe[column] == scalar]]
output.append(query) # append all query results to a list
return pd.concat(output) # return concatenated list of dataframes
However, this appears to be exceptionally slow, and doesn't actually take advantage of the pandas data structure. What is the "standard" way to pass through a list of queries through a pandas dataframe?
EDIT: How does this translate into "more complex" queries in pandas? e.g. where
with an HDF5 document?
df.to_hdf('test.h5','df',mode='w',format='table',data_columns=['A','B'])
pd.read_hdf('test.h5','df')
pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')
If I understood your question correctly you can do it either using boolean indexing as @uhjish has already shown in his answer or using query() method:
In [30]: search_list = ['rocks','mountains']
In [31]: df
Out[31]:
Col1 Col2 Col3 Col4
0 what the 0 0
1 are curves 1 8
2 men of 2 16
3 to your 3 24
4 rocks lips 4 32
5 and rewrite 5 40
6 mountains history. 6 48
.query()
method:
In [32]: df.query('Col1 in @search_list and Col4 > 40')
Out[32]:
Col1 Col2 Col3 Col4
6 mountains history. 6 48
In [33]: df.query('Col1 in @search_list')
Out[33]:
Col1 Col2 Col3 Col4
4 rocks lips 4 32
6 mountains history. 6 48
using boolean indexing:
In [34]: df.ix[df.Col1.isin(search_list) & (df.Col4 > 40)]
Out[34]:
Col1 Col2 Col3 Col4
6 mountains history. 6 48
In [35]: df.ix[df.Col1.isin(search_list)]
Out[35]:
Col1 Col2 Col3 Col4
4 rocks lips 4 32
6 mountains history. 6 48
UPDATE: using function:
def find_queries(df, qry, debug=0, **parms):
if debug:
print('[DEBUG]: Query:\t' + qry.format(**parms))
return df.query(qry.format(**parms))
In [31]: find_queries(df, 'Col1 in {Col1} and Col4 > {Col4}', Col1='@search_list', Col4=40)
...:
Out[31]:
Col1 Col2 Col3 Col4
6 mountains history. 6 48
In [32]: find_queries(df, 'Col1 in {Col1} and Col4 > {Col4}', Col1='@search_list', Col4=10)
Out[32]:
Col1 Col2 Col3 Col4
4 rocks lips 4 32
6 mountains history. 6 48
including debugging info (print query):
In [40]: find_queries(df, 'Col1 in {Col1} and Col4 > {Col4}', Col1='@search_list', Col4=10, debug=1)
[DEBUG]: Query: Col1 in @search_list and Col4 > 10
Out[40]:
Col1 Col2 Col3 Col4
4 rocks lips 4 32
6 mountains history. 6 48
The best way to deal with this is by indexing into the rows using a Boolean series as you would in R.
Using your df as an example,
In [5]: df.Col1 == "what"
Out[5]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
Name: Col1, dtype: bool
In [6]: df[df.Col1 == "what"]
Out[6]:
Col1 Col2 Col3 Col4
0 what the 0 0
Now we combine this with the pandas isin function.
In [8]: df[df.Col1.isin(["men","rocks","mountains"])]
Out[8]:
Col1 Col2 Col3 Col4
2 men of 2 16
4 rocks lips 4 32
6 mountains history. 6 48
To filter on multiple columns we can chain them together with & and | operators like so.
In [10]: df[df.Col1.isin(["men","rocks","mountains"]) | df.Col2.isin(["lips","your"])]
Out[10]:
Col1 Col2 Col3 Col4
2 men of 2 16
3 to your 3 24
4 rocks lips 4 32
6 mountains history. 6 48
In [11]: df[df.Col1.isin(["men","rocks","mountains"]) & df.Col2.isin(["lips","your"])]
Out[11]:
Col1 Col2 Col3 Col4
4 rocks lips 4 32
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With