I have a pandas DataFrame
that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer
. However, the text includes missing values, and so I would like to impute a constant value before vectorizing.
My initial idea was to create a Pipeline
of SimpleImputer
and CountVectorizer
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)
pipe.fit_transform(df[['text']]).toarray()
However, the fit_transform
errors because SimpleImputer
outputs a 2D array and CountVectorizer
requires 1D input. Here's the error message:
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
QUESTION: How can I modify this Pipeline
so that it will work?
NOTE: I'm aware that I can impute missing values in pandas. However, I would like to accomplish all preprocessing in scikit-learn so that the same preprocessing can be applied to new data using Pipeline
.
The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings.
Pipeline serves multiple purposes here: Convenience and encapsulation. You only have to call fit and predict once on your data to fit a whole sequence of estimators. Joint parameter selection.
Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.
The best solution I have found is to insert a custom transformer into the Pipeline
that reshapes the output of SimpleImputer
from 2D to 1D before it is passed to CountVectorizer
.
Here's the complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})
# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)
pipe.fit_transform(df[['text']]).toarray()
It has been proposed on GitHub that CountVectorizer
should allow 2D input as long as the second dimension is 1 (meaning: a single column of data). That modification to CountVectorizer
would be a great solution to this problem!
One solution would be to create a class off SimpleImputer and override its transform()
method:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
class ModifiedSimpleImputer(SimpleImputer):
def transform(self, X):
return super().transform(X).flatten()
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
imp = ModifiedSimpleImputer(strategy='constant')
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)
pipe.fit_transform(df[['text']]).toarray()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With