Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop rows that only relate to one value in other columns pandas

imagine I have dataframe like this:

item     name     gender
banana   tom      male
banana   kate     female
apple    kate     female
kiwi     jim      male
apple    tom      male
banana   kimmy    female
kiwi     kate     female
banana   tom      male

Is there any way to drop rows that the person only relate(buy) less than 2 item? Also I don't want to drop duplicates. So the output I want like this:

item     name     gender
banana   tom      male
banana   kate     female
apple    kate     female
apple    tom      male
kiwi     kate     female
banana   tom      male 
like image 616
rockpock555 Avatar asked Oct 14 '22 21:10

rockpock555


1 Answers

@sammywemmy's solution: df.loc[df.groupby('name').item.transform('size').ge(2)]

  1. groupby to group rows with the same name together
# Get Each Group
print(df.groupby('name').apply(lambda s: s.reset_index()))
         index    item   name  gender
name                                 
jim   0      3    kiwi    jim    male
kate  0      1  banana   kate  female
      1      2   apple   kate  female
      2      6    kiwi   kate  female
kimmy 0      5  banana  kimmy  female
tom   0      0  banana    tom    male
      1      4   apple    tom    male
      2      7  banana    tom    male
  1. transform to get a value in every row that represents the group size. (Number of rows)
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['item'].transform('size')
print(df)
     item   name  gender  group_size
0  banana    tom    male           3
1  banana   kate  female           3
2   apple   kate  female           3
3    kiwi    jim    male           1
4   apple    tom    male           3
5  banana  kimmy  female           1
6    kiwi   kate  female           3
7  banana    tom    male           3

This could have been done on any column in this case:

# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['gender'].transform('size')
print(df)
     item   name  gender  group_size
0  banana    tom    male           3
1  banana   kate  female           3
2   apple   kate  female           3
3    kiwi    jim    male           1
4   apple    tom    male           3
5  banana  kimmy  female           1
6    kiwi   kate  female           3
7  banana    tom    male           3

Notice how now each row has the corresponding group size at the end. tom has 3 instances so every name == tom row has 3 in group_size.

  1. ge Convert to Boolean Index based on relational operator
# Add Condition To determine if the row should be kept or not
df['should_keep'] = df.groupby('name')['item'].transform('size').ge(2)
print(df)
     item   name  gender  group_size  should_keep
0  banana    tom    male           3         True
1  banana   kate  female           3         True
2   apple   kate  female           3         True
3    kiwi    jim    male           1        False
4   apple    tom    male           3         True
5  banana  kimmy  female           1        False
6    kiwi   kate  female           3         True
7  banana    tom    male           3         True
  1. loc use Boolean Index to get the desired rows
print(df.groupby('name')['item'].transform('size').ge(2))
0     True
1     True
2     True
3    False
4     True
5    False
6     True
7     True
Name: item, dtype: bool

loc will include any index that is True, any index that is False will be excluded. (indexes 3 and 5 are False so they will not be included)


All together:

import pandas as pd

df = pd.DataFrame({'item': {0: 'banana', 1: 'banana', 2: 'apple',
                            3: 'kiwi', 4: 'apple', 5: 'banana',
                            6: 'kiwi', 7: 'banana'},
                   'name': {0: 'tom', 1: 'kate', 2: 'kate',
                            3: 'jim', 4: 'tom', 5: 'kimmy',
                            6: 'kate', 7: 'tom'},
                   'gender': {0: 'male', 1: 'female',
                              2: 'female', 3: 'male',
                              4: 'male', 5: 'female',
                              6: 'female', 7: 'male'}})

print(df.loc[df.groupby('name')['name'].transform('size').ge(2)])
     item  name  gender
0  banana   tom    male
1  banana  kate  female
2   apple  kate  female
4   apple   tom    male
6    kiwi  kate  female
7  banana   tom    male
like image 169
Henry Ecker Avatar answered Oct 18 '22 13:10

Henry Ecker