What is out of bag error in Random Forests? Is it the optimal parameter for finding the right number of trees in a Random Forest?
The out-of-bag (OOB) error is the average error for each calculated using predictions from the trees that do not contain in their respective bootstrap sample. This allows the RandomForestClassifier to be fit and validated whilst being trained [1].
Most of the features have shown negligible importance - the mean is about 5%, a third of them is of importance 0, a third of them is of importance above the mean. However, perhaps the most striking fact is the oob (out-of-bag) score: a bit less than 1%.
The OOB estimate of error rate is a useful measure to discriminate between different random forest classifiers. We could, for instance, vary the number of trees or the number of variables to be considered, and select the combination that produces the smallest value for this error rate.
Similarly, each of the OOB sample rows is passed through every DT that did not contain the OOB sample row in its bootstrap training data and a majority prediction is noted for each row. And lastly, the OOB score is computed as the number of correctly predicted rows from the out of bag sample.
I will take an attempt to explain:
Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).
T = {(X1,y1), (X2,y2), ... (Xn, yn)}
and
Xi is input vector {xi1, xi2, ... xiM} yi is the label (or output or class).
summary of RF:
Random Forests algorithm is a classifier based on primarily two methods -
Suppose we decide to have S
number of trees in our forest then we first create S
datasets of "same size as original"
created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS}
datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti
can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping
. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.
Now, RF creates S
trees and uses m (=sqrt(M) or =floor(lnM+1))
random subfeatures out of M
possible features to create any tree. This is called random subspace method.
So for each Ti
bootstrap dataset you create a tree Ki
. If you want to classify some input data D = {x1, x2, ..., xM}
you let it pass through each tree and produce S
outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}
. Final prediction is a majority vote on this set.
Out-of-bag error:
After creating the classifiers (S
trees), for each (Xi,yi)
in the original training set i.e. T
, select all Tk
which does not include (Xi,yi)
. This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n
such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk
such that it does not contain (xi,yi)
.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi
's).
Why is it important?
The study of error estimates for bagged classifiers in Breiman [1996b], gives empirical evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.1
(Thanks @Rudolf for corrections. His comments below.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With