Logo Questions Linux Laravel Mysql Ubuntu Git Menu

pandas: complex filter on rows of DataFrame




I would like to filter rows by a function of each row, e.g.

def f(row):   return sin(row['velocity'])/np.prod(['masses']) > 5  df = pandas.DataFrame(...) filtered = df[apply_to_all_rows(df, f)] 

Or for another more complex, contrived example,

def g(row):   if row['col1'].method1() == 1:     val = row['col1'].method2() / row['col1'].method3(row['col3'], row['col4'])   else:     val = row['col2'].method5(row['col6'])   return np.sin(val)  df = pandas.DataFrame(...) filtered = df[apply_to_all_rows(df, g)] 

How can I do so?

like image 216
duckworthd Avatar asked Jul 10 '12 16:07


People also ask

How do I filter specific rows from a DataFrame pandas?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.

How do you filter a DataFrame in multiple conditions?

Using Loc to Filter With Multiple Conditions The loc function in pandas can be used to access groups of rows or columns by label. Add each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows.

2 Answers

You can do this using DataFrame.apply, which applies a function along a given axis,

In [3]: df = pandas.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])  In [4]: df Out[4]:            a         b         c 0 -0.001968 -1.877945 -1.515674 1 -0.540628  0.793913 -0.983315 2 -1.313574  1.946410  0.826350 3  0.015763 -0.267860 -2.228350 4  0.563111  1.195459  0.343168  In [6]: df[df.apply(lambda x: x['b'] > x['c'], axis=1)] Out[6]:            a         b         c 1 -0.540628  0.793913 -0.983315 2 -1.313574  1.946410  0.826350 3  0.015763 -0.267860 -2.228350 4  0.563111  1.195459  0.343168 
like image 134
duckworthd Avatar answered Sep 20 '22 08:09


Suppose I had a DataFrame as follows:

In [39]: df Out[39]:        mass1     mass2  velocity 0  1.461711 -0.404452  0.722502 1 -2.169377  1.131037  0.232047 2  0.009450 -0.868753  0.598470 3  0.602463  0.299249  0.474564 4 -0.675339 -0.816702  0.799289 

I can use sin and DataFrame.prod to create a boolean mask:

In [40]: mask = (np.sin(df.velocity) / df.ix[:, 0:2].prod(axis=1)) > 0  In [41]: mask Out[41]:  0    False 1    False 2    False 3     True 4     True 

Then use the mask to select from the DataFrame:

In [42]: df[mask] Out[42]:        mass1     mass2  velocity 3  0.602463  0.299249  0.474564 4 -0.675339 -0.816702  0.799289 
like image 30
Chang She Avatar answered Sep 20 '22 08:09

Chang She