I am working on a classification task with Scikit-learn. I have a data set in which each observation comprises two separate text fields. I want to set up a Pipeline in which each text field is passed in parallel through its own TfidfVectorizer and the outputs of the TfidfVectorizer objects are passed to a classifier. My aim is to be able to optimize the parameters of the two TfidfVectorizer objects along with those of the classifier, using GridSearchCV.
The Pipeline might be depicted as follows:
Text 1 -> TfidfVectorizer 1 --------|
+---> Classifier
Text 2 -> TfidfVectorizer 2 --------|
I understand how to do this without using a Pipeline (by just creating to TfidfVectorizer objects and working from there), but how do I set this up inside a Pipeline?
Thanks for any help,
Rob.
Use the Pipeline
and FeatureUnion
classes. The code for your case would look something like:
pipeline = Pipeline([
('features', FeatureUnion([
('c1', Pipeline([
('text1', ExtractText1()),
('tf_idf1', TfidfVectorizer())
])),
('c2', Pipeline([
('text2', ExtractText2()),
('tf_idf2', TfidfVectorizer())
]))
])),
('classifier', MultinomialNB())
])
You can do a grid search over the entire structure by referring to the parameters by using the <estimator1>__<estimator2>__<parameter>
syntax. For example features__c1__tf_idf1__min_df
refers to the min_df
parameter of TfidfVectorizer 1
from your diagram.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With