Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn FeatureUnion gridsearch over subsets of features

How can I use a FeatureUnion in scikit learn, so that the Gridsearch can treat its parts optionally?

The code below works and sets up a FeatureUnion with a TfidfVectorizer for words and a TfidfVectorizer for chars.

When doing a Gridsearch, in addition to testing the defined parameter space, I would also like to test only 'vect__wordvect' with its ngram_range parameter (without there being a TfidfVectorizer for the chars), and also only 'vect__lettervect' with the lowercase parameter True and False, the other TfidfVectorizer being disabled.

EDIT: Complete code example based on maxymoo suggestion.

How can this be done?

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import fetch_20newsgroups

# setup the featureunion
wordvect = TfidfVectorizer(analyzer='word')
lettervect = CountVectorizer(analyzer='char')
featureunionvect = FeatureUnion([("lettervect", lettervect), ("wordvect", wordvect)])

# setup the pipeline
classifier = LogisticRegression(class_weight='balanced')
pipeline = Pipeline([('vect', featureunionvect), ('classifier', classifier)])

# gridsearch parameters 
parameters = {
            'vect__wordvect__ngram_range': [(1, 1), (1, 2)],  # commenting out these two lines
            'vect__lettervect__lowercase': [True, False],     # runs, but there is no parameterization anymore
            'vect__transformer_list': [[('wordvect', wordvect)],
                                        [('lettervect', lettervect)],
                                        [('wordvect', wordvect), ('lettervect', lettervect)]]}
gs_clf = GridSearchCV(pipeline, parameters)

# data
newsgroups_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])

# gridsearch CV
gs_clf = GridSearchCV(pipeline, parameters)
gs_clf = gs_clf.fit(newsgroups_train.data, newsgroups_train.target)
for score in gs_clf.grid_scores_:
    print "gridsearch scores: ", score
like image 421
tkja Avatar asked Oct 19 '22 10:10

tkja


1 Answers

The FeatureUnion has a parameter called transformer_list that you could use to grid-search over; so in your case your grid search parameters would become

parameters = {'vect__wordvect__ngram_range': [(1, 1), (1, 2)],
              'vect__lettervect__lowercase': [True, False],
              'vect__transformer_weights': [{"lettervect":1,"wordvect":0}, 
                                            {"lettervect":0,"wordvect":1}, 
                                            {"lettervect":1,"wordvect":1}]}
like image 56
maxymoo Avatar answered Oct 21 '22 05:10

maxymoo