How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

Question

In the documentation of SciKit-Learn Random Forest classifier , it is stated that

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training.

Am I missing something here?

2 revs, 2 users 92% · Accepted Answer

I believe this part of docs answers your question

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

The key to understanding is in "sample drawn with replacement". This means that each instance can be drawn more than once. This in turn means, that some instances in the train set are present several times and some are not present at all (out-of-bag). Those are different for different trees

Soren Havelund Welling · Answer

Certainly not all samples are selected for each tree. Be default each sample has a 1-((N-1)/N)^N~0.63 chance of being sampled for one particular tree and 0.63^2 for being sampled twice, and 0.63^3 for being sampled 3 times... where N is the sample size of the training set.

Each bootstrap sample selection is in average enough different from other bootstraps, such that decision trees are adequately different, such that the average prediction of trees is robust toward the variance of each tree model. If sample size could be increased to 5 times more than training set size, every observation would probably be present 3-7 times in each tree and the overall ensemble prediction performance would suffer.

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

Tags:

python

scikit-learn

random-forest

subsampling

TAK

2 Answers

2 revs, 2 users 92%

Soren Havelund Welling

Recent Activity

Donate For Us

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

Tags:

python

scikit-learn

random-forest

subsampling

TAK

2 Answers

2 revs, 2 users 92%

Soren Havelund Welling

Related questions

Recent Activity

Donate For Us