Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamically filtering a pandas dataframe

I am trying to filter a pandas data frame using thresholds for three columns

import pandas as pd
df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
                   "B" : [2, 5, 3, 2, 6],
                   "C" : [-5, 2, 1, 8, 2]})
df = df.loc[(df.A > 0) & (df.B > 2) & (df.C > -1)].reset_index(drop = True)

df
    A  B  C
0   2  5  2
1  10  3  1
2   3  6  2

However, I want to do this inside a function where the names of the columns and their thresholds are given to me in a dictionary. Here's my first try that works ok. Essentially I am putting the filter inside cond variable and just run it:

df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
                   "B" : [2, 5, 3, 2, 6],
                   "C" : [-5, 2, 1, 8, 2]})
limits_dic = {"A" : 0, "B" : 2, "C" : -1}
cond = "df = df.loc["
for key in limits_dic.keys():
    cond += "(df." + key + " > " + str(limits_dic[key])+ ") & "
cond = cond[:-2] + "].reset_index(drop = True)"
exec(cond)
df
    A  B  C
0   2  5  2
1  10  3  1
2   3  6  2

Now, finally I put everything inside a function and it stops working (perhaps exec function does not like to be used inside a function!):

df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
                   "B" : [2, 5, 3, 2, 6],
                   "C" : [-5, 2, 1, 8, 2]})
limits_dic = {"A" : 0, "B" : 2, "C" : -1}
def filtering(df, limits_dic):
    cond = "df = df.loc["
    for key in limits_dic.keys():
        cond += "(df." + key + " > " + str(limits_dic[key])+ ") & "
    cond = cond[:-2] + "].reset_index(drop = True)"
    exec(cond)
    return(df)

df = filtering(df, limits_dic)
df
    A  B  C
0   6  2 -5
1   2  5  2
2  10  3  1
3  -5  2  8
4   3  6  2

I know that exec function acts differently when used inside a function but was not sure how to address the problem. Also, I am wondering there must be a more elegant way to define a function to do the filtering given two input: 1)df and 2)limits_dic = {"A" : 0, "B" : 2, "C" : -1}. I would appreciate any thoughts on this.

like image 936
ahoosh Avatar asked Aug 28 '17 18:08

ahoosh


3 Answers

If you're trying to build a dynamic query, there are easier ways. Here's one using a list comprehension and str.join:

query = ' & '.join(['{}>{}'.format(k, v) for k, v in limits_dic.items()])

Or, using f-strings with python-3.6+,

query = ' & '.join([f'{k}>{v}' for k, v in limits_dic.items()])

print(query)

'A>0 & C>-1 & B>2'

Pass the query string to df.query, it's meant for this very purpose:

out = df.query(query)
print(out)

    A  B  C
1   2  5  2
2  10  3  1
4   3  6  2

What if my column names have whitespace, or other weird characters?

From pandas 0.25, you can wrap your column name in backticks so this works:

query = ' & '.join([f'`{k}`>{v}' for k, v in limits_dic.items()])

See this Stack Overflow post for more.


You could also use df.eval if you want to obtain a boolean mask for your query, and then indexing becomes straightforward after that:

mask = df.eval(query)
print(mask)

0    False
1     True
2     True
3    False
4     True
dtype: bool

out = df[mask]
print(out)

    A  B  C
1   2  5  2
2  10  3  1
4   3  6  2

String Data

If you need to query columns that use string data, the code above will need a slight modification.

Consider (data from this answer):

df = pd.DataFrame({'gender':list('MMMFFF'),
                   'height':[4,5,4,5,5,4],
                   'age':[70,80,90,40,2,3]})

print (df)
  gender  height  age
0      M       4   70
1      M       5   80
2      M       4   90
3      F       5   40
4      F       5    2
5      F       4    3

And a list of columns, operators, and values:

column = ['height', 'age', 'gender']
equal = ['>', '>', '==']
condition = [1.68, 20, 'F']

The appropriate modification here is:

query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
df.query(query)

   age gender  height
3   40      F       5

For information on the pd.eval() family of functions, their features and use cases, please visit Dynamic Expression Evaluation in pandas using pd.eval().

like image 74
cs95 Avatar answered Oct 19 '22 22:10

cs95


An alternative to @coldspeed 's version:

conditions = None
for key, val in limit_dic.items():
    cond = df[key] > val
    if conditions is None:
        conditions = cond
    else:
        conditions = conditions & cond
print(df[conditions])
like image 35
Victor Yan Avatar answered Oct 19 '22 21:10

Victor Yan


An alternative to both posted, that may or may not be more pythonic:

import pandas as pd
import operator
from functools import reduce

df = pd.DataFrame({"A": [6, 2, 10, -5, 3],
                   "B": [2, 5, 3, 2, 6],
                   "C": [-5, 2, 1, 8, 2]})

limits_dic = {"A": 0, "B": 2, "C": -1}

# equiv to [df['A'] > 0, df['B'] > 2 ...]
loc_elements = [df[key] > val for key, val in limits_dic.items()]

df = df.loc[reduce(operator.and_, loc_elements)]
like image 3
yvesva Avatar answered Oct 19 '22 20:10

yvesva