I am using two estimators, Randomforest and SVM
random_forest_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('random_forest',RandomForestClassifier())
])
svm_pipeline=Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('svm',LinearSVC())
])
I want to first vectorize the data and then use the estimator, I was going through this online tutorial . then I use the hyper parameter as follows
parameters=[
{
'vectorizer__max_features':[500,1000,1500],
'random_forest__min_samples_split':[50,100,250,500]
},
{
'vectorizer__max_features':[500,1000,1500],
'svm__C':[1,3,5]
}
]
and passed to the GridSearchCV
pipelines=[random_forest_pipeline,svm_pipeline]
grid_search=GridSearchCV(pipelines,param_grid=parameters,cv=3,n_jobs=-1)
grid_search.fit(x_train,y_train)
but, when I run the code I get an error
TypeError: estimator should be an estimator implementing 'fit' method
Don't know why am I getting this error
It is quite possible to do it in a single Pipeline
/GridSearchCV
, based on an example here.
You just have to explicitly mention the scoring
method for the pipeline since we are not declaring the final estimator initially.
Example:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
my_pipeline = Pipeline([
('vectorizer', CountVectorizer(stop_words='english')),
('clf', 'passthrough')
])
parameters = [
{
'vectorizer__max_features': [500, 1000],
'clf':[RandomForestClassifier()],
'clf__min_samples_split':[50, 100,]
},
{
'vectorizer__max_features': [500, 1000],
'clf':[LinearSVC()],
'clf__C':[1, 3]
}
]
grid_search = GridSearchCV(my_pipeline, param_grid=parameters, cv=3, n_jobs=-1, scoring='accuracy')
grid_search.fit(X, y)
grid_search.best_params_
> # {'clf': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
# criterion='gini', max_depth=None, max_features='auto',
# max_leaf_nodes=None, max_samples=None,
# min_impurity_decrease=0.0, min_impurity_split=None,
# min_samples_leaf=1, min_samples_split=100,
# min_weight_fraction_leaf=0.0, n_estimators=100,
# n_jobs=None, oob_score=False, random_state=None,
# verbose=0, warm_start=False),
# 'clf__min_samples_split': 100,
# 'vectorizer__max_features': 1000}
pd.DataFrame(grid_search.cv_results_)[['param_vectorizer__max_features',
'param_clf__min_samples_split',
'param_clf__C','mean_test_score',
'rank_test_score']]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With