Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing.

My initial idea was to create a Pipeline of SimpleImputer and CountVectorizer:

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

However, the fit_transform errors because SimpleImputer outputs a 2D array and CountVectorizer requires 1D input. Here's the error message:

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

QUESTION: How can I modify this Pipeline so that it will work?

NOTE: I'm aware that I can impute missing values in pandas. However, I would like to accomplish all preprocessing in scikit-learn so that the same preprocessing can be applied to new data using Pipeline.

like image 797
Kevin Markham Avatar asked Jul 20 '20 17:07

Kevin Markham


People also ask

What does Sklearn SimpleImputer do?

The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings.

Which of the below purposes are served by a Scikit-learn pipeline?

Pipeline serves multiple purposes here: Convenience and encapsulation. You only have to call fit and predict once on your data to fit a whole sequence of estimators. Joint parameter selection.

What is from Sklearn pipeline import Make_pipeline?

Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

What is Columntransformer in Sklearn?

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.


2 Answers

The best solution I have found is to insert a custom transformer into the Pipeline that reshapes the output of SimpleImputer from 2D to 1D before it is passed to CountVectorizer.

Here's the complete code:

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})

# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()

It has been proposed on GitHub that CountVectorizer should allow 2D input as long as the second dimension is 1 (meaning: a single column of data). That modification to CountVectorizer would be a great solution to this problem!

like image 60
Kevin Markham Avatar answered Oct 07 '22 21:10

Kevin Markham


One solution would be to create a class off SimpleImputer and override its transform() method:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer


class ModifiedSimpleImputer(SimpleImputer):
    def transform(self, X):
        return super().transform(X).flatten()


df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = ModifiedSimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()
like image 34
Arash Khodadadi Avatar answered Oct 07 '22 22:10

Arash Khodadadi