What is out of bag error in Random Forests? Is it the optimal parameter for finding the right number of trees in a Random Forest?

I will take an attempt to explain: Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables). <code>T = {(X1,y1), (X2,y2), ... (Xn, yn)}</code> and <pre class="prettyprint"><code>Xi is input vector {xi1, xi2, ... xiM} yi is the label (or output or class). </code></pre> summary of RF: Random Forests algorithm is a classifier based on primarily two methods - <ul> <li>Bagging</li> <li>Random subspace method. </li> </ul> Suppose we decide to have <code>S</code> number of trees in our forest then we first create <code>S</code> datasets of <code>"same size as original"</code> created from random resampling of data in T with-replacement (n times for each dataset). This will result in <code>{T1, T2, ... TS}</code> datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset <code>Ti</code> can have duplicate data records and Ti can be missing several data records from original datasets. This is called <code>Bootstrapping</code>. (en.wikipedia.org/wiki/Bootstrapping_(statistics)) Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap. Now, RF creates <code>S</code> trees and uses <code>m (=sqrt(M) or =floor(lnM+1))</code> random subfeatures out of <code>M</code> possible features to create any tree. This is called random subspace method. So for each <code>Ti</code> bootstrap dataset you create a tree <code>Ki</code>. If you want to classify some input data <code>D = {x1, x2, ..., xM}</code> you let it pass through each tree and produce <code>S</code> outputs (one for each tree) which can be denoted by <code>Y = {y1, y2, ..., ys}</code>. Final prediction is a majority vote on this set. Out-of-bag error: After creating the classifiers (<code>S</code> trees), for each <code>(Xi,yi)</code> in the original training set i.e. <code>T</code>, select all <code>Tk</code> which does not include <code>(Xi,yi)</code>. This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are <code>n</code> such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over <code>Tk</code> such that it does not contain <code>(xi,yi)</code>. Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known <code>yi</code>'s). Why is it important? <blockquote> The study of error estimates for bagged classifiers in <a href="https://www.stat.berkeley.edu/~breiman/OOBestimation.pdf" rel="noreferrer">Breiman [1996b]</a>, gives empirical evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.1 </blockquote> (Thanks @Rudolf for corrections. His comments below.)

What is out of bag error in Random Forests? [closed]

1 Answers

I will take an attempt to explain:

Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).

T = {(X1,y1), (X2,y2), ... (Xn, yn)}

and

Xi is input vector {xi1, xi2, ... xiM}  yi is the label (or output or class).

summary of RF:

Random Forests algorithm is a classifier based on primarily two methods -

Bagging
Random subspace method.

Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics))

Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.

Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree. This is called random subspace method.

So for each Ti bootstrap dataset you create a tree Ki. If you want to classify some input data D = {x1, x2, ..., xM} you let it pass through each tree and produce S outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}. Final prediction is a majority vote on this set.

Out-of-bag error:

After creating the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).

Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's).

Why is it important?

The study of error estimates for bagged classifiers in Breiman [1996b], gives empirical evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.¹

(Thanks @Rudolf for corrections. His comments below.)

174

answered Sep 19 '22 12:09

Manoj Awasthi

Related questions
                            
                                Training a Neural Network with Reinforcement learning
                            
                                How do you cope with coders coma? [closed]
                            
                                Can you explain the Context design pattern?
                            
                                When have you come upon the halting problem in the field? [closed]
                            
                                What does bitwise XOR (exclusive OR) mean?
                            
                                Which coding style you use for ternary operator? [closed]
                            
                                Generate all unique substrings for given string
                            
                                What USEFUL bitwise operator code tricks should a developer know about? [closed]
                            
                                Purity vs Referential transparency
                            
                                Is it OK to use DYLD_LIBRARY_PATH on Mac OS X? And, what's the dynamic library search algorithm with it?
                            
                                What is the fastest integer factorization algorithm?
                            
                                Singletons: good design or a crutch? [closed]
                            
                                Is there an algorithm for color mixing that works like mixing real colors?
                            
                                How do you mock a Sealed class?
                            
                                Can you provide examples of parsing HTML?
                            
                                Function for creating color wheels [closed]
                            
                                Conditional logging with minimal cyclomatic complexity
                            
                                How do you reproduce bugs that occur sporadically?
                            
                                How do I distinguish between 'binary' and 'text' files?
                            
                                Is a double really unsuitable for money?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is out of bag error in Random Forests? [closed]

Tags:

language-agnostic

machine-learning

classification

random-forest

csalive

People also ask

1 Answers

Manoj Awasthi

Recent Activity

Donate For Us