I am working on fitting a RandomForestClassifier
and came across two parameters: min_sample_split
and min_sample_leaf
.
Do I need to set both min_sample_split
and min_sample_leaf
?
I think I just need one of them since one is effectively half of the other. Am I correct in my understanding?
So basically min_sample_split
is the minimum no. of sample required for a split. For instance, if min_sample_split = 6
and there are 4 samples in the node, then the split will not happen (regardless of entropy).
min_sample_leaf
on the other hand is basically the minimum no. of sample required to be a leaf node. For example, if a node contains 5 samples, it can be split into two leaf nodes of size 2 and 3 respectively. Now suppose you have min_sample_leaf = 3
, then the split will not occur, because the minimum leaf size if 3, and you can't have a new node with only 2 samples.
You can take a look at this and this for further reading.
Update : the difference in behaviour of RandomForest and GradientBoostClassifier is attributed largely to the way how they train themselves(gradient boosting is an ensemble of sequential classifiers), you can read more about it here to understand the internal working of gradient boosting
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With