Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use multiple input features with associated extractors in a pipeline?

Tags:

scikit-learn

I am working on a classification task with Scikit-learn. I have a data set in which each observation comprises two separate text fields. I want to set up a Pipeline in which each text field is passed in parallel through its own TfidfVectorizer and the outputs of the TfidfVectorizer objects are passed to a classifier. My aim is to be able to optimize the parameters of the two TfidfVectorizer objects along with those of the classifier, using GridSearchCV.

The Pipeline might be depicted as follows:

Text 1 -> TfidfVectorizer 1 --------|
                                    +---> Classifier
Text 2 -> TfidfVectorizer 2 --------|

I understand how to do this without using a Pipeline (by just creating to TfidfVectorizer objects and working from there), but how do I set this up inside a Pipeline?

Thanks for any help,

Rob.

like image 612
Rob Goon Avatar asked Oct 19 '22 22:10

Rob Goon


1 Answers

Use the Pipeline and FeatureUnion classes. The code for your case would look something like:

pipeline = Pipeline([
  ('features', FeatureUnion([
    ('c1', Pipeline([
      ('text1', ExtractText1()),
      ('tf_idf1', TfidfVectorizer())
    ])),
    ('c2', Pipeline([
      ('text2', ExtractText2()),
      ('tf_idf2', TfidfVectorizer())
    ]))
  ])),
  ('classifier', MultinomialNB())
])

You can do a grid search over the entire structure by referring to the parameters by using the <estimator1>__<estimator2>__<parameter> syntax. For example features__c1__tf_idf1__min_df refers to the min_df parameter of TfidfVectorizer 1 from your diagram.

like image 180
Daniel Avatar answered Oct 22 '22 23:10

Daniel