After using OpenCV for boosting I'm trying to implement my own version of the Adaboost
algorithm (check here, here and the original paper for some references).
By reading all the material I've came up with some questions regarding the implementation of the algorithm.
1) It is not clear to me how the weights a_t of each weak learner are assigned.
In all the sources I've pointed out the choice is a_t = k * ln( (1-e_t) / e_t )
, k being a positive constant and e_t the error rate of the particular weak learner.
At page 7 of this source it says that that particular value minimizes a certain convex differentiable function, but I really don't understand the passage.
Can anyone please explain it to me?
2) I have some doubts on the procedure of weight update of the training samples.
Clearly it should be done in such a way to guarantee that they remain a probability distribution. All the references adopt this choice:
D_{t+1}(i) = D_{t}(i) * e^(-a_ty_ih_t(x_i)) / Z_t (where Z_t is a normalization factor chosen so that D_{t+1} is a distribution).
I hope this is the right place to post this question, if not please redirect me!
Thanks in advance for any help you can provide.
An important hyperparameter for AdaBoost algorithm is the number of decision trees used in the ensemble. Recall that each decision tree used in the ensemble is designed to be a weak learner. That is, it has skill over random prediction, but is not highly skillful.
The two cases for alpha (positive or negative) indicate: Alpha is positive when the predicted and the actual output agree (the sample was classified correctly). In this case we decrease the sample weight from what it was before, since we're already performing well.
Explore the number of trees An important hyperparameter for Adaboost is n_estimator. Often by changing the number of base models or weak learners we can adjust the accuracy of the model. The number of trees added to the model must be high for the model to work well, often hundreds, if not thousands.
An AdaBoost [1] classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
1) Your first question:
a_t = k * ln( (1-e_t) / e_t )
Since the error on training data is bounded by product of Z_t)alpha), and Z_t(alpha) is convex w.r.t. alpha, and thus there is only one "global" optimal alpha which minimize the upperbound of the error. This is the intuition of how you find the magic "alpha"
2) Your 2nd question: But why is the particular choice of weight update multiplicative with the exponential of error rate made by the particular weak learner?
To cut it short: the intuitive way of finding the above alpha is indeed improve the accuracy. This is not surprising: you are actually trusting more (by giving larger alpha weight) of the learners who work better than the others, and trust less (by giving smaller alpha) to those who work worse. For those learners brining no new knowledge than the previous learners, you assign weight alpha equal 0.
It is possible to prove (see) that the final boosted hypothesis yielding training error bounded by
exp(-2 \sigma_t (1/2 - epsilon_t)^2 )
3) Your 3rd question: Are there any other updates possible? And if yes is there a proof that this update guarantees some kind of optimality of the learning process?
This is hard to say. But just remember here the update is improving the accuracy on the "training data" (at the risk of over-fitting), but it is hard to say about its generality.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With