I was trying to use scikit-learn
package with python-3.4 to do a grid search,
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.preprocessing import LabelBinarizer
import numpy as np
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression)
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10)
}
if __name__ == '__main__':
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv = 3)
df = pd.read_csv('SMS Spam Collection/SMSSpamCollection', delimiter='\t', header=None)
lb = LabelBinarizer()
X, y = df[1], np.array([number[0] for number in lb.fit_transform(df[0])])
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print('Best score: ', grid_search.best_score_)
print('Best parameter set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(best_parameters):
print(param_name, best_parameters[param_name])
However, it does not run successfully, the error message looks like this:
Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
Traceback (most recent call last):
File "/home/xiangru/PycharmProjects/machine_learning_note_with_sklearn/grid search.py", line 36, in <module>
grid_search.fit(X_train, y_train)
File "/usr/local/lib/python3.4/dist-packages/sklearn/grid_search.py", line 732, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "/usr/local/lib/python3.4/dist-packages/sklearn/grid_search.py", line 493, in _fit
base_estimator = clone(self.estimator)
File "/usr/local/lib/python3.4/dist-packages/sklearn/base.py", line 47, in clone
new_object_params[name] = clone(param, safe=False)
File "/usr/local/lib/python3.4/dist-packages/sklearn/base.py", line 35, in clone
return estimator_type([clone(e, safe=safe) for e in estimator])
File "/usr/local/lib/python3.4/dist-packages/sklearn/base.py", line 35, in <listcomp>
return estimator_type([clone(e, safe=safe) for e in estimator])
File "/usr/local/lib/python3.4/dist-packages/sklearn/base.py", line 35, in clone
return estimator_type([clone(e, safe=safe) for e in estimator])
File "/usr/local/lib/python3.4/dist-packages/sklearn/base.py", line 35, in <listcomp>
return estimator_type([clone(e, safe=safe) for e in estimator])
File "/usr/local/lib/python3.4/dist-packages/sklearn/base.py", line 45, in clone
new_object_params = estimator.get_params(deep=False)
TypeError: get_params() missing 1 required positional argument: 'self'
I also tried to use only
if __name__ == '__main__':
pipeline.get_params()
It gives the same error message. Who knows how to fix this?
The Python "TypeError: __init__() missing 1 required positional argument" occurs when we forget to provide a required argument when instantiating a class. To solve the error, specify the argument when instantiating the class or set a default value for the argument.
The Python "TypeError: missing 2 required positional arguments" occurs when we forget to provide 2 required arguments when calling a function or method. To solve the error, specify the arguments when calling the function or set default values for the arguments.
This error is almost always misleading, and actually means that you're calling an instance method on the class, rather than the instance (like calling dict.keys()
instead of d.keys()
on a dict
named d
).*
And that's exactly what's going on here. The docs imply that the best_estimator_
attribute, like the estimator
parameter to the initializer, is not an estimator instance, it's an estimator type, and "A object of that type is instantiated for each grid point."
So, if you want to call methods, you have to construct an object of that type, for some particular grid point.
However, from a quick glance at the docs, if you're trying to get the params that were used for the particular instance of the best estimator that returned the best score, isn't that just going to be best_params_
? (I apologize that this part is a bit of a guess…)
For the Pipeline
call, you definitely have an instance there. And the only documentation for that method is a param spec which shows that it takes one optional argument, deep
. But under the covers, it's probably forwarding the get_params()
call to one of its attributes. And with ('clf', LogisticRegression)
, it looks like you're constructing it with the class LogisticRegression
, rather than an instance of that class, so if that's what it ends up forwarding to, that would explain the problem.
* The reason the error says "missing 1 required positional argument: 'self'" instead of "must be called on an instance" or something is that in Python, d.keys()
is effectively turned into dict.keys(d)
, and it's perfectly legal (and sometimes useful) to call it that way explicitly, so Python can't really tell you that dict.keys()
is illegal, just that it's missing the self
argument.
I finally get the problem solved. The reason is exactly as what abarnert said.
Firstly I tried:
pipeline = LogisticRegression()
parameters = {
'penalty': ('l1', 'l2'),
'C': (0.01, 0.1, 1, 10)
}
and it works well.
With that intuition I modified the pipeline to be:
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
Note that there is a ()
after LogisticRegression
.
This time it works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With