Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python fill missing values according to frequency

I have seen a lot of cases missing values are either filled by mean or medians. I was wondering how can we fill misssing values with frequency.

Here is my setup:

import numpy as np
import pandas as pd


df = pd.DataFrame({'sex': [1,1,1,1,0,0,np.nan,np.nan,np.nan]})
df['sex_fillna'] = df['sex'].fillna(df.sex.mode()[0])
print(df)
   sex  sex_fillna
0  1.0         1.0  We have 4 males
1  1.0         1.0
2  1.0         1.0
3  1.0         1.0
4  0.0         0.0  we have 2 females, so ratio is 2
5  0.0         0.0
6  NaN         1.0  Here, I want random choice of [1,1,0]  
7  NaN         1.0  eg. 1,1,0 or 1,0,1 or 0,1,1 randomly
8  NaN         1.0

Is there a generic way it can be done so?

My attempt

df['sex_fillan2'] = df['sex'].fillna(np.random.randint(0,2)) # here the ratio is not guaranteed to approx 4/2 = 2

NOTE This example is only for binary values, I was looking for categorical values having more than two categories.

For example:

class: A   B   C
       20% 40% 60%

Then instead of filling all nans by class C I would like to fill according to frequency counts.

But, is this a good idea?

As per some comments, this might or might not be a good idea to impute missing values with different values for different rows, I have created a question in CrossValidated, if you want to give some inputs or see if this is a good idea visit the page: https://stats.stackexchange.com/questions/484467/is-it-better-to-fillnans-based-on-frequency-rather-than-all-values-with-mean-or

like image 696
BhishanPoudel Avatar asked Aug 23 '20 14:08

BhishanPoudel


2 Answers

Check with value_counts + np.random.choice

s = df.sex.value_counts(normalize=True)
df['sex_fillna'] = df['sex']
df.loc[df.sex.isna(), 'sex_fillna'] = np.random.choice(s.index, p=s.values, size=df.sex.isna().sum())
df
Out[119]: 
   sex  sex_fillna
0  1.0         1.0
1  1.0         1.0
2  1.0         1.0
3  1.0         1.0
4  0.0         0.0
5  0.0         0.0
6  NaN         0.0
7  NaN         1.0
8  NaN         1.0

The output for s index is the category and the value is the probability

s
Out[120]: 
1.0    0.666667
0.0    0.333333
Name: sex, dtype: float64
like image 107
BENY Avatar answered Oct 19 '22 23:10

BENY


A generic answer in case you have more than 2 valid values in your column is to find the distribution and fill based on that. For example,

dist = df.sex.value_counts(normalize=True)
print(list)
1.0    0.666667
0.0    0.333333
Name: sex, dtype: float64

Then get the rows with missing values

nan_rows = df['sex'].isnull()

Finally, fill the those rows with randomly selected values based on the above distribution

df.loc[nan_rows,'sex'] = np.random.choice(dist.index, size=len(df[nan_rows]),p=dist.values)
like image 36
Tasos Avatar answered Oct 20 '22 00:10

Tasos