I have a pandas data-frame in which one column sign up
has multiple null values. The sign up
column has categorical values that includes multiple OS
such as iOS
, android
, web
etc.
I would like to fill the NA
values from the existing OS
values but the NA values should be filled as per the existing distribution of OS
values.
Example: Lets say, the dataset has OS values count distribution as follows:
signup
android web 14
ios web 16
mac 5
other 3
windows 6
Name: id, dtype: int64
I would like to fill the NA values based on the above distribution of the distinct OS values. The reason that I would like to do is to maintain the current distribution as filling with Mode
value would likely to skew the results.
Can someone help on how to achieve this.
There are two approaches to replace NaN values with zeros in Pandas DataFrame: fillna(): function fills NA/NaN values using the specified method. replace(): df. replace()a simple method used to replace a string, regex, list, dictionary.
The fillna() function is used to fill NA/NaN values using the specified method. Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame).
fillna() method is used to fill NaN/NA values on a specified column or on an entire DataaFrame with any given value. You can specify modify using inplace, or limit how many filling to perform or choose an axis whether to fill on rows/column etc. The Below example fills all NaN values with None value.
You could use something like Numpy's random.choice
starting with a frame fitting your description
import numpy as np
import pandas as pd
print(df)
id signup
0 1 mac
1 2 mac
2 3 mac
3 4 other
4 5 other
5 6 windows
6 7 windows
7 8 windows
8 9 windows
9 10 NaN
10 11 NaN
11 12 NaN
12 13 NaN
13 14 NaN
Updated using piRSquared's tip in the comments figuring out the current distribution
s = df.signup.value_counts(normalize=True)
print(s)
windows 0.444444
mac 0.333333
other 0.222222
Name: signup, dtype: float64
We'll use boolean indexing next to filter by the nans we want to update. Also, this is where we use the random choice by passing the index (windows, mac, other), the size needed and the distribution of each signup will be used for the probabilities(p) parameter.
missing = df['signup'].isnull()
df.loc[missing,'signup'] = np.random.choice(s.index, size=len(df[missing]),p=s.values)
print(df)
id signup
0 1 mac
1 2 mac
2 3 mac
3 4 other
4 5 other
5 6 windows
6 7 windows
7 8 windows
8 9 windows
9 10 windows
10 11 windows
11 12 mac
12 13 windows
13 14 other
replace=True
isnull = df.signup.isnull()
sample = df.signup.dropna().sample(isnull.sum(), replace=True).values
df.loc[isnull, 'signup'] = sample
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With