Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Imputation of missing values for categories in pandas

Tags:

python

pandas

The question is how to fill NaNs with most frequent levels for category column in pandas dataframe?

In R randomForest package there is na.roughfix option : A completed data matrix or data frame. For numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.

in Pandas for numeric variables I can fill NaN values with :

df = df.fillna(df.median()) 
like image 322
Igor Barinov Avatar asked Sep 16 '15 20:09

Igor Barinov


People also ask

How do you impute categorical data in pandas?

One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.

Can you impute categorical variables?

Replace missing values with the most frequent value: You can always impute them based on Mode in the case of categorical variables, just make sure you don't have highly skewed class distributions.

How do you impute missing values in pandas DataFrame?

In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.


2 Answers

You can use df = df.fillna(df['Label'].value_counts().index[0]) to fill NaNs with the most frequent value from one column.

If you want to fill every column with its own most frequent value you can use

df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

UPDATE 2018-25-10

Starting from 0.13.1 pandas includes mode method for Series and Dataframes. You can use it to fill missing values for each column (using its own most frequent value) like this

df = df.fillna(df.mode().iloc[0]) 
like image 138
hellpanderr Avatar answered Sep 25 '22 04:09

hellpanderr


def fillna(col):     col.fillna(col.value_counts().index[0], inplace=True)     return col df=df.apply(lambda col:fillna(col)) 
like image 33
Pratik Gohil Avatar answered Sep 23 '22 04:09

Pratik Gohil