imagine I have dataframe like this:
item name gender
banana tom male
banana kate female
apple kate female
kiwi jim male
apple tom male
banana kimmy female
kiwi kate female
banana tom male
Is there any way to drop rows that the person only relate(buy) less than 2 item? Also I don't want to drop duplicates. So the output I want like this:
item name gender
banana tom male
banana kate female
apple kate female
apple tom male
kiwi kate female
banana tom male
@sammywemmy's solution:
df.loc[df.groupby('name').item.transform('size').ge(2)]
# Get Each Group
print(df.groupby('name').apply(lambda s: s.reset_index()))
index item name gender
name
jim 0 3 kiwi jim male
kate 0 1 banana kate female
1 2 apple kate female
2 6 kiwi kate female
kimmy 0 5 banana kimmy female
tom 0 0 banana tom male
1 4 apple tom male
2 7 banana tom male
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['item'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
This could have been done on any column in this case:
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['gender'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
Notice how now each row has the corresponding group size at the end. tom
has 3 instances so every name == tom
row has 3 in group_size
.
# Add Condition To determine if the row should be kept or not
df['should_keep'] = df.groupby('name')['item'].transform('size').ge(2)
print(df)
item name gender group_size should_keep
0 banana tom male 3 True
1 banana kate female 3 True
2 apple kate female 3 True
3 kiwi jim male 1 False
4 apple tom male 3 True
5 banana kimmy female 1 False
6 kiwi kate female 3 True
7 banana tom male 3 True
print(df.groupby('name')['item'].transform('size').ge(2))
0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 True
Name: item, dtype: bool
loc
will include any index that is True
, any index that is False
will be excluded. (indexes 3 and 5 are False
so they will not be included)
All together:
import pandas as pd
df = pd.DataFrame({'item': {0: 'banana', 1: 'banana', 2: 'apple',
3: 'kiwi', 4: 'apple', 5: 'banana',
6: 'kiwi', 7: 'banana'},
'name': {0: 'tom', 1: 'kate', 2: 'kate',
3: 'jim', 4: 'tom', 5: 'kimmy',
6: 'kate', 7: 'tom'},
'gender': {0: 'male', 1: 'female',
2: 'female', 3: 'male',
4: 'male', 5: 'female',
6: 'female', 7: 'male'}})
print(df.loc[df.groupby('name')['name'].transform('size').ge(2)])
item name gender
0 banana tom male
1 banana kate female
2 apple kate female
4 apple tom male
6 kiwi kate female
7 banana tom male
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With