Impute categorical missing values in scikit-learn

Tags:

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is in implementation. Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature. Once I run:

from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0) imp.fit(df)

Python generates an error: 'could not convert string to float: 'run1'', where 'run1' is an ordinary (non-missing) value from the first column with categorical data.

Any help would be very welcome

424

asked Aug 11 '14 09:08

night_bat

1 Answers

To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.

import pandas as pd import numpy as np  from sklearn.base import TransformerMixin  class DataFrameImputer(TransformerMixin):      def __init__(self):         """Impute missing values.          Columns of dtype object are imputed with the most frequent value          in column.          Columns of other types are imputed with mean of column.          """     def fit(self, X, y=None):          self.fill = pd.Series([X[c].value_counts().index[0]             if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],             index=X.columns)          return self      def transform(self, X, y=None):         return X.fillna(self.fill)  data = [     ['a', 1, 2],     ['b', 1, 1],     ['b', 2, 2],     [np.nan, np.nan, np.nan] ]  X = pd.DataFrame(data) xt = DataFrameImputer().fit_transform(X)  print('before...') print(X) print('after...') print(xt)

which prints,

before...      0   1   2 0    a   1   2 1    b   1   1 2    b   2   2 3  NaN NaN NaN after...    0         1         2 0  a  1.000000  2.000000 1  b  1.000000  1.000000 2  b  2.000000  2.000000 3  b  1.333333  1.666667

137

answered Sep 21 '22 17:09

sveitser

Related questions
                            
                                How do I run Python script using arguments in windows command line
                            
                                bit-wise operation unary ~ (invert)
                            
                                How to get the list of options that Python was compiled with?
                            
                                Python object.__repr__(self) should be an expression?
                            
                                Are locks unnecessary in multi-threaded Python code because of the GIL?
                            
                                Python 3 string.join() equivalent?
                            
                                Fail to get data on using read() of StringIO in python
                            
                                How to assert that an iterable is not empty on Unittest?
                            
                                How to use JDBC source to write and read data in (Py)Spark?
                            
                                URL Decode with Python 3
                            
                                format strings and named arguments in Python
                            
                                Object does not support item assignment error
                            
                                Unit testing a python app that uses the requests library
                            
                                pandas select from Dataframe using startswith
                            
                                What is wrong with using a bare 'except'? [duplicate]
                            
                                How do I use cache_clear() on python @functools.lru_cache
                            
                                Get all documents of a collection using Pymongo
                            
                                Exception thrown in multiprocessing Pool not detected
                            
                                Pandas merge two dataframes with different columns
                            
                                see if two files have the same content in python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Impute categorical missing values in scikit-learn

Tags:

python

pandas

imputation

scikit-learn

night_bat

People also ask

1 Answers

sveitser

Recent Activity

Donate For Us