How can I get TFIDF Vectorizer to return a pandas dataframe with respective column names, inside an sklearn pipeline used for cross-validation?
I have an Sklearn Pipeline, where one of the steps is a TFIDF Vectorizer:
class InspectPipeline(BaseEstimator, TransformerMixin):
def transform(self, x):
return x
def fit(self, x, y=None):
self.df = x
return self
pipeline = Pipeline(
[
("selector", ItemSelector(key="text_column")),
("vectorizer", TfidfVectorizer()),
("debug", InspectPipeline()),
("classifier", RandomForestClassifier())
]
)
I have created the class InspectPipeline
in order to later inspect what were the features passed to the classifier (by running pipeline.best_estimator_.named_steps['debug'].df
). However, TfidfVectorizer returns a sparse matrix which is what I get when I do pipeline.best_estimator_.named_steps['debug'].df
. Instead of getting a sparse matrix, I would like to get the TFIDF vector as a pandas dataframe, where the column names are the respective tfidf tokens.
I know that tfidf_vectorizer.get_feature_names()
could help know the column names. But how can I include this + transforming the sparse matrix to a dataframe, within the pipeline?
You can extend TfidfVectorizer to instead return a DataFrame with the desired column names, and use that in your pipeline.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
class DenseTfidfVectorizer(TfidfVectorizer):
def transform(self, raw_documents, copy=True):
X = super().transform(raw_documents, copy=copy)
df = pd.DataFrame(X.toarray(), columns=self.get_feature_names())
return df
def fit_transform(self, raw_documents, y=None):
X = super().fit_transform(raw_documents, y=y)
df = pd.DataFrame(X.toarray(), columns=self.get_feature_names())
return df
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With