I have a pandas DataFrame with two columns: toy and color. The color column includes missing values.
How do I fill the missing color values with the most frequent color for that particular toy?
Here's the code to create a sample dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'toy':['car'] * 4 + ['train'] * 5 + ['ball'] * 3 + ['truck'],
'color':['red', 'blue', 'blue', np.nan, 'green', np.nan,
'red', 'red', np.nan, 'blue', 'red', np.nan, 'green']
})
Here's the sample dataset:
toy color
0 car red
1 car blue
2 car blue
3 car NaN
4 train green
5 train NaN
6 train red
7 train red
8 train NaN
9 ball blue
10 ball red
11 ball NaN
12 truck green
Here's the desired result:
color for a car.color for a train.color for a ball.Notes about the real dataset:
toy types (not just four).toy types that only have missing values for color, so the answer does not need to handle that case.This question is related, but it doesn't answer my question of how to use the most frequent value to fill in missing values.
You can use groupby()+transform()+fillna():
df['color']=df['color'].fillna(df.groupby('toy')['color'].transform(lambda x:x.mode().iat[0]))
OR
If want to select random values when there are 2 or more frequent values:
from random import choice
df['color']=df['color'].fillna(df.groupby('toy')['color'].transform(lambda x:choice(x.mode())))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With