Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas : filter the rows based on a column containing lists

Tags:

How to filter the rows in a data frame based on another column value?

I have a data frame which is,

ip_df:
     class    name     marks          min_marks  min_subjects
0    I        tom      [89,85,80,74]  80         2
1    II       sam      [65,72,43,40]  85         1

Based on the column values of "min_subject" and "min_marks", the row should be filtered.

  • For index 0, the "min_subjects" is "2", at least 2 elements in "marks" column should be greater than 80 i.e., "min_marks" column then a new column named "flag" has to be added as 1

  • For index 1, the "min_subjects" is "1", at least 1 element in "marks" column should be greater than 85 i.e., "min_marks" column then a new column named "flag" has to be added as 0 (i.e., flag=0 as the condition didnt satisfy here)

The final outcome should be,

op_df:
     class    name     marks          min_marks  min_subjects flag
0    I        tom      [89,85,80,74]  80         2            1
1    II       sam      [65,72,43,40]  85         1            0

Can anyone help me to achieve the same in the data frame?

like image 347
Mahamutha M Avatar asked Nov 12 '19 11:11

Mahamutha M


People also ask

How do I filter out rows in pandas DataFrame?

You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.


2 Answers

Use list comprehension with zip by 3 columns, compare each value in generator and sum for count, last compare by minimal marks and convert to integers:

df['flag'] = [1 if sum(x > c for x in a) >= b else 0 
                 for a, b, c in zip(df['marks'], df['min_subjects'], df['min_marks'])]

Alternative with convert boolean by int to 0,1:

df['flag'] = [int(sum(x > c for x in a) >= b)
                 for a, b, c in zip(df['marks'], df['min_subjects'], df['min_marks'])]

Or solution with numpy:

df['flag'] = [int(np.sum(np.array(a) > c) >= b)
                  for a, b, c in zip(df['marks'], df['min_subjects'], df['min_marks'])]

print (df)
  class name             marks  min_marks  min_subjects  flag
0     I  tom  [89, 85, 80, 74]         80             2     1
1    II  sam  [65, 72, 43, 40]         85             1     0
like image 79
jezrael Avatar answered Oct 16 '22 14:10

jezrael


To avoid the for loop and make full use of parallel computations you can use the new function explode (Pandas 0.25.0):

df1 = df.explode('marks')
print(df1)

Output:

  class name marks  min_marks  min_subjects
0     I  tom    89         80             2
0     I  tom    85         80             2
0     I  tom    80         80             2
0     I  tom    74         80             2
1    II  sam    65         85             1
1    II  sam    72         85             1
1    II  sam    43         85             1
1    II  sam    40         85             1

Compare the columns marks and min_marks:

df['flag'] = df1['marks'].gt(df1['min_marks'])\
.groupby(df1.index).sum().ge(df['min_subjects']).astype(int)

print(df)

Output:

  class name             marks  min_marks  min_subjects  flag
0     I  tom  [89, 85, 80, 74]         80             2     1
1    II  sam  [65, 72, 43, 40]         85             1     0
like image 33
Mykola Zotko Avatar answered Oct 16 '22 14:10

Mykola Zotko