Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn.ensemble.AdaBoostClassifier cannot accecpt SVM as base_estimator?

I am doing a text classification task. Now I want to use ensemble.AdaBoostClassifier with LinearSVC as base_estimator. However, when I try to run the code

clf = AdaBoostClassifier(svm.LinearSVC(),n_estimators=50, learning_rate=1.0,    algorithm='SAMME.R')
clf.fit(X, y)

An error occurred. TypeError: AdaBoostClassifier with algorithm='SAMME.R' requires that the weak learner supports the calculation of class probabilities with a predict_proba method

The first question is Cannot the svm.LinearSVC() calculate the class probabilities ? How to make it calculate the probabilities?

Then I Change the parameter algorithm and run the code again.

clf = AdaBoostClassifier(svm.LinearSVC(),n_estimators=50, learning_rate=1.0, algorithm='SAMME')
clf.fit(X, y)

This time TypeError: fit() got an unexpected keyword argument 'sample_weight' happens. As is said in AdaBoostClassifier, Sample weights. If None, the sample weights are initialized to 1 / n_samples. Even if I assign an integer to n_samples, error also occurred.

The second question is What does n_samples mean? How to solve this problem?

Hope anyone could help me.

According to @jme 's comment, however, after trying

clf = AdaBoostClassifier(svm.SVC(kernel='linear',probability=True),n_estimators=10,  learning_rate=1.0, algorithm='SAMME.R')
clf.fit(X, y)

The program cannot get a result and the memory used on the server keeps unchanged.

The third question is how I can make AdaBoostClassifier work with SVC as base_estimator?

like image 583
allenwang Avatar asked Nov 24 '14 14:11

allenwang


People also ask

Can we use AdaBoost for regression?

→ AdaBoost algorithms can be used for both classification and regression problem.

Can you use AdaBoost with random forest?

Models trained using both Random forest and AdaBoost classifier make predictions that generalize better with a larger population. The models trained using both algorithms are less susceptible to overfitting / high variance.

How can I improve my AdaBoost?

Explore the number of trees An important hyperparameter for Adaboost is n_estimator. Often by changing the number of base models or weak learners we can adjust the accuracy of the model. The number of trees added to the model must be high for the model to work well, often hundreds, if not thousands.


1 Answers

The right answer will depend on exactly what you're looking for. LinearSVC cannot predict class probabilities (required by default algorithm used by AdaBoostClassifier) and does not support sample_weight.

You should be aware that the Support Vector Machine does not nominally predict class probabilities. They are computed using Platt scaling (or an extension of Platt scaling in the multi-class case), a technique which has known issues. If you need less "artificial" class probabilities, an SVM might not be the way to go.

With that said, I believe the most satisfying answer given your question would be that given by Graham. That is,

from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(SVC(probability=True, kernel='linear'), ...)

You have other options. You can use SGDClassifier with a hinge loss function and set AdaBoostClassifier to use the SAMME algorithm (which does not require a predict_proba function, but does require support for sample_weight):

from sklearn.linear_model import SGDClassifier

clf = AdaBoostClassifier(SGDClassifier(loss='hinge'), algorithm='SAMME', ...)

Perhaps the best answer would be to use a classifier that has native support for class probabilities, like Logistic Regression, if you wanted to use the default algorithm provided for AdaBoostClassifier. You can do this using scikit.linear_model.LogisticRegression or using SGDClassifier with a log loss function, as used in the code provided by Kris.

Hope that helps, if you're curious about what Platt scaling is, check out the original paper by John Platt here.

like image 152
kevin Avatar answered Sep 22 '22 16:09

kevin