Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impute categorical missing values in scikit-learn

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is in implementation. Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature. Once I run:

from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0) imp.fit(df)  

Python generates an error: 'could not convert string to float: 'run1'', where 'run1' is an ordinary (non-missing) value from the first column with categorical data.

Any help would be very welcome

like image 424
night_bat Avatar asked Aug 11 '14 09:08

night_bat


People also ask

How do you impute categorical data Sklearn?

You can use Sklearn. impute class SimpleImputer to impute/replace missing values for both numerical and categorical features. For numerical missing values, a strategy such as mean, median, most frequent, and constant can be used. For categorical features, a strategy such as the most frequent and constant can be used.

How do you impute missing values for categorical variables?

Imputation Method 1: Most Common Class One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.

Can you impute categorical variables?

Replace missing values with the most frequent value: You can always impute them based on Mode in the case of categorical variables, just make sure you don't have highly skewed class distributions.

How do you handle categorical null values in Python?

Step 1: Find which category occurred most in each category using mode(). Step 2: Replace all NAN values in that column with that category. Step 3: Drop original columns and keep newly imputed columns. Advantage: Simple and easy to implement for categorical variables/columns.


1 Answers

To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.

import pandas as pd import numpy as np  from sklearn.base import TransformerMixin  class DataFrameImputer(TransformerMixin):      def __init__(self):         """Impute missing values.          Columns of dtype object are imputed with the most frequent value          in column.          Columns of other types are imputed with mean of column.          """     def fit(self, X, y=None):          self.fill = pd.Series([X[c].value_counts().index[0]             if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],             index=X.columns)          return self      def transform(self, X, y=None):         return X.fillna(self.fill)  data = [     ['a', 1, 2],     ['b', 1, 1],     ['b', 2, 2],     [np.nan, np.nan, np.nan] ]  X = pd.DataFrame(data) xt = DataFrameImputer().fit_transform(X)  print('before...') print(X) print('after...') print(xt) 

which prints,

before...      0   1   2 0    a   1   2 1    b   1   1 2    b   2   2 3  NaN NaN NaN after...    0         1         2 0  a  1.000000  2.000000 1  b  1.000000  1.000000 2  b  2.000000  2.000000 3  b  1.333333  1.666667 
like image 137
sveitser Avatar answered Sep 21 '22 17:09

sveitser