Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn Random Forest out of bag sample

I am trying to access the out of bag samples associated with each tree in a RandomForestClassifier with no luck. I found other informations like Gini score and split feature for each node, looking there : https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx

Does anyone know if it is possible to get the out of bag sample related to a tree ? If not maybe it is possible to get the 'in bag' sample (subset of the dataset used for a specific tree) and then compute the OOB using the original data set ?

Thanks in advance

like image 509
wootwoot Avatar asked Oct 22 '15 14:10

wootwoot


People also ask

What is out of bag sample in random forest?

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging). Bagging uses subsampling with replacement to create training samples for the model to learn from.

How do I find out my Oob score?

Similarly, each of the OOB sample rows is passed through every DT that did not contain the OOB sample row in its bootstrap training data and a majority prediction is noted for each row. And lastly, the OOB score is computed as the number of correctly predicted rows from the out of bag sample.

Which sampling is used in random forest?

The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample.

What is a good Oob score?

There's no such thing as good oob_score, its the difference between valid_score and oob_score that matters. Think of oob_score as a score for some subset(say, oob_set) of training set. To learn how its created refer this.


1 Answers

You can just figure this out by yourself from source code, look how private _set_oob_score method of random forest works. Every tree estimator in scikit-learn has it's own seed for pseudo random number generator, it's stored inside estimator.random_state field.

During fit procedure every estimator learns on subset of training set, indices for subset of training set will be generated with PRNG and seed from estimator.random_state.

This should work:

from sklearn.ensemble.forest import _generate_unsampled_indices
# X here - training set of examples
n_samples = X.shape[0]
for tree in rf.estimators_:
    # Here at each iteration we obtain out of bag samples for every tree.
    unsampled_indices = _generate_unsampled_indices(
    tree.random_state, n_samples)
like image 73
Ibraim Ganiev Avatar answered Nov 15 '22 11:11

Ibraim Ganiev