Although it wasn't originally conceived this way, the standard Adaboost algorithm is equivalent to conducting a forward stagewise additive model estimation using an exponential loss function. That is, given some weak classifiers c1,...,cM, and sample points x1,...,xN the weights coming out of the algorithm:
The strong classifier is the output, F_M(x). The loss function that makes this strong learner the same as the Adaboost output is
L(y,f(x)) = exp(-y*f(x))
for classifiers taking values in {-1,1}. This is all explained in Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Section 10.4.
My question has to do with forward stagewise regression. It is a greedy algorithm, in that once w_i is estimated, it is fixed, then the weight w_i+1 is found, etc. This seems like it is really designed to handle "independent" weak classifiers, like stump classifiers, or tree classifiers restricted to mutually exclusive independent variables (features), so that the residual piece after fitting a classifier is not explained by that classifier any more.
In other words, to fit a set of functions to an given target function, I wouldn't fit the first one, fix that coefficient, then find the optimal coefficient for the second, holding the first constant, etc... unless I knew that the functions were independent. But this is what the algorithm does, more or less.
Does this explain the success of Adaboost with stump learners or decision trees compared to (from my experience) Adaboost with more comprehensive classifiers like SVM, or a linear model? Can someone give a reference? - I have not seen this aspect discussed in the literature. Thanks.
I think you may be a bit confused, or using terminology I'm not familiar with. Nothing in AdaBoost or more general stage wise additive models is independent or mutually exclusive, nor were the designed to be mutually exclusive.
Does this explain the success of Adaboost with stump learners or decision trees compared to (from my experience) Adaboost with more comprehensive classifiers like SVM, or a linear model?
No. Methods that produce an ensemble of classifiers can be powerful, and their power comes mostly from the ability to reduce the error caused by variance of the base model. AdaBoost & others can also reduce the bias, but it is much easier to reduce variance induced error.
For this reason we use Decision Trees, as we can control the level of bias/variance on the tree by altering the maximum depth of the tree. This makes life easy, but they are not the end all of boosting (example: boosting in a high dimensional space is quite difficult, and trees are horrible in such situations).
We don't usually use linear models in boosting because they simply aren't that good at it. We can produce "easy" data sets that will not converge well being boosted by linear models without too much thought (consider 1 ring in another where each class is of equal size, and a base learner that cuts the inner (and therefor outer) ring in half). A lowly decision stump is often better simply because it has a non-linearity in it, allowing for a much faster adaption to the data.
We avoid complex models such as SVMs because they take a long time to train. Regardless of what base model you choose, AdaBoost is going to run towards the same type of solution (it tries to maximize the L1 margin, where SVMs maximize the L2 margin). If you have to boost 1000 trees, or 500 SVMs, its going to probably be a lot faster to boost the trees. This doesn't even get into all the parameter search you would have to do for each SVM for each model added. It is simply too time consuming. However, there are cases where this can work well - here is a face detection case.
There is also the issue of prediction time. If you need to boost 100 or 1000 models, its going to increase the prediction time by a 2 or 3 orders of magnitude. SMVs are already not the fastest predictors, and this only makes the situation worse.
The details of this are more picked up from the math then discussed in english. If you are interested in more explicit discussion of the issue about why such models work, read some of Leo Breiman's papers.
The idea behind adaboost is each tree, or more generally weak learner, was trained individually and was trained in a certain order and weight on all of the outcomes. The first classifier is trained on the base input and a weight for the tree is calculated, the second classifier is trained with the elements that the first tree classified correctly weighted lower and the elements that the first tree misclassified weighted higher and a weight is learned for the second classifier, the third classifier is trained with things that the new classifier of the first and second tree classified correctly weighted lower and the things that they classified incorrectly weighted higher, and so on and so forth.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With