Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn's SimpleImputer doesn't work in a pipeline?

I have a pandas dataframe that has some NaN values in a particular column:

1291   NaN
1841   NaN
2049   NaN
Name: some column, dtype: float64

And I have made the following pipeline in order to deal with it:

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()

pipe = Pipeline([('imputer', imputer),
                 ('scaler', scaler), 
                 ('logistic', logistic)])

Now when I pass this pipeline to a RandomizedSearchCV, I get the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

It's actually quite a bit longer than that -- I can post the entire error in an edit if neccesary. Anyway, I am quite sure that this column is the only column that contains NaNs. Moreover, if I switch from SimpleImputer to the (now deprecated) Imputer in the pipeline, the pipeline works just fine in my RandomizedSearchCV. I checked the documentation, but it seems that SimpleImputer is supposed to behave in (nearly) the exact same way as Imputer. What is the difference in behavior? How do use an imputer in my pipeline without using the deprecated Imputer?

like image 499
Marcel Avatar asked Aug 08 '18 08:08

Marcel


1 Answers

SimpleImputer in make_pipeline

preprocess_pipeline = make_pipeline(   
    FeatureUnion(transformer_list=[
        ('Handle numeric columns', make_pipeline(
            ColumnSelector(columns=['Amount']),
            SimpleImputer(strategy='constant', fill_value=0),
            StandardScaler()
        )),
        ('Handle categorical data', make_pipeline(
            ColumnSelector(columns=['Type', 'Name', 'Changes']),
            SimpleImputer(strategy='constant', missing_values=' ', fill_value='missing_value'),
            OneHotEncoder(sparse=False)
        ))
    ])
)

SimpleImputer in Pipeline

('features', FeatureUnion ([
     ('Cat Columns', Pipeline([
          ('Category Extractor', TypeSelector(np.number)),
                 ('Impute Zero', SimpleImputer(strategy="constant", fill_value=0))
                                    ])),
('Numerics', Pipeline([
      ('Numeric Extractor', TypeSelector("category")),
          ('Impute Missing', SimpleImputer(strategy="constant", fill_value='missing'))
          ]))        
     ]))
like image 85
hanzgs Avatar answered Sep 25 '22 07:09

hanzgs