How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

Tags:

I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing.

My initial idea was to create a Pipeline of SimpleImputer and CountVectorizer:

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

However, the fit_transform errors because SimpleImputer outputs a 2D array and CountVectorizer requires 1D input. Here's the error message:

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

QUESTION: How can I modify this Pipeline so that it will work?

NOTE: I'm aware that I can impute missing values in pandas. However, I would like to accomplish all preprocessing in scikit-learn so that the same preprocessing can be applied to new data using Pipeline.

797

asked Jul 20 '20 17:07

Kevin Markham

2 Answers

The best solution I have found is to insert a custom transformer into the Pipeline that reshapes the output of SimpleImputer from 2D to 1D before it is passed to CountVectorizer.

Here's the complete code:

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})

# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()

It has been proposed on GitHub that CountVectorizer should allow 2D input as long as the second dimension is 1 (meaning: a single column of data). That modification to CountVectorizer would be a great solution to this problem!

answered Oct 07 '22 21:10

Kevin Markham

One solution would be to create a class off SimpleImputer and override its transform() method:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer


class ModifiedSimpleImputer(SimpleImputer):
    def transform(self, X):
        return super().transform(X).flatten()


df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = ModifiedSimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

answered Oct 07 '22 22:10

Arash Khodadadi

Related questions
                            
                                interpolate missing values 2d python
                            
                                How to remove the extra row (or column) after transpose() in Pandas
                            
                                Google Search Web Scraping with Python
                            
                                How can I slice each element of a numpy array of strings?
                            
                                Difference between '[:]' and '[::]' slicing when copying a list?
                            
                                No module named urllib3
                            
                                Python subprocess.call not waiting for process to finish blender
                            
                                pandas groupby where you get the max of one column and the min of another column
                            
                                Python error when calling NumPy from class method with map
                            
                                Tox WARNING:test command found but not installed in testenv
                            
                                Not able to upload local files in google colab
                            
                                How to efficiently find the indices of matching elements in two lists
                            
                                Simplifying an 'if' statement with bool()
                            
                                What do 1_000 and 100_000 mean? [duplicate]
                            
                                Selenium can't click element because other element obscures it
                            
                                How to monitor gradient vanish and explosion in keras with tensorboard?
                            
                                How to display matplotlib plots in a Jupyter tab widget?
                            
                                In-place modification of Python lists
                            
                                'func' is not recognized as an internal or external command, operable program or batch file [duplicate]
                            
                                Python TypeError: 'set' object is not subscriptable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

Tags:

python

machine-learning

imputation

scikit-learn

countvectorizer

Kevin Markham

People also ask

2 Answers

Kevin Markham

Arash Khodadadi

Recent Activity

Donate For Us