Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to return a dataframe from sklearn TFIDF Vectorizer within pipeline?

How can I get TFIDF Vectorizer to return a pandas dataframe with respective column names, inside an sklearn pipeline used for cross-validation?

I have an Sklearn Pipeline, where one of the steps is a TFIDF Vectorizer:

class InspectPipeline(BaseEstimator, TransformerMixin):

    def transform(self, x):
        return x

    def fit(self, x, y=None):
        self.df = x
        return self


pipeline = Pipeline(
        [
         ("selector", ItemSelector(key="text_column")),
         ("vectorizer", TfidfVectorizer()),
         ("debug", InspectPipeline()),
         ("classifier", RandomForestClassifier())
        ]
)

I have created the class InspectPipeline in order to later inspect what were the features passed to the classifier (by running pipeline.best_estimator_.named_steps['debug'].df). However, TfidfVectorizer returns a sparse matrix which is what I get when I do pipeline.best_estimator_.named_steps['debug'].df . Instead of getting a sparse matrix, I would like to get the TFIDF vector as a pandas dataframe, where the column names are the respective tfidf tokens.

I know that tfidf_vectorizer.get_feature_names() could help know the column names. But how can I include this + transforming the sparse matrix to a dataframe, within the pipeline?

like image 726
Glyph Avatar asked Mar 05 '23 06:03

Glyph


1 Answers

You can extend TfidfVectorizer to instead return a DataFrame with the desired column names, and use that in your pipeline.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

class DenseTfidfVectorizer(TfidfVectorizer):

    def transform(self, raw_documents, copy=True):
        X = super().transform(raw_documents, copy=copy)
        df = pd.DataFrame(X.toarray(), columns=self.get_feature_names())
        return df

    def fit_transform(self, raw_documents, y=None):
        X = super().fit_transform(raw_documents, y=y)
        df = pd.DataFrame(X.toarray(), columns=self.get_feature_names())
        return df
like image 73
swhat Avatar answered Mar 08 '23 22:03

swhat