Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

In the documentation of SciKit-Learn Random Forest classifier , it is stated that

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training.

Am I missing something here?

like image 919
TAK Avatar asked Mar 06 '16 13:03

TAK


2 Answers

I believe this part of docs answers your question

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

The key to understanding is in "sample drawn with replacement". This means that each instance can be drawn more than once. This in turn means, that some instances in the train set are present several times and some are not present at all (out-of-bag). Those are different for different trees

like image 158
2 revs, 2 users 92% Avatar answered Nov 14 '22 22:11

2 revs, 2 users 92%


Certainly not all samples are selected for each tree. Be default each sample has a 1-((N-1)/N)^N~0.63 chance of being sampled for one particular tree and 0.63^2 for being sampled twice, and 0.63^3 for being sampled 3 times... where N is the sample size of the training set.

Each bootstrap sample selection is in average enough different from other bootstraps, such that decision trees are adequately different, such that the average prediction of trees is robust toward the variance of each tree model. If sample size could be increased to 5 times more than training set size, every observation would probably be present 3-7 times in each tree and the overall ensemble prediction performance would suffer.

like image 38
Soren Havelund Welling Avatar answered Nov 15 '22 00:11

Soren Havelund Welling