Difference between min_samples_split and min_samples_leaf in sklearn DecisionTreeClassifier

Tags:

scikit-learn

I was going through sklearn class DecisionTreeClassifier.

Looking at parameters for the class, we have two parameters min_samples_split and min_samples_leaf. Basic idea behind them looks similar, you specify a minimum number of samples required to decide a node to be leaf or split further.

Why do we need two parameters when one implies the other?. Is there any reason or scenario which distinguish them?.

311

asked Sep 29 '17 01:09

Hara Chaitanya

2 Answers

Both parameters will produce similar results, the difference is the point of view.

The min_samples_split parameter will evaluate the number of samples in the node, and if the number is less than the minimum the split will be avoided and the node will be a leaf.

The min_samples_leaf parameter checks before the node is generated, that is, if the possible split results in a child with fewer samples, the split will be avoided (since the minimum number of samples for the child to be a leaf has not been reached) and the node will be replaced by a leaf.

In all cases, when we have samples with more than one Class in a leaf, the Final Class will be the most likely to happen, according to the samples that reached it in training.

answered Oct 14 '22 02:10

Marcello Novaes

From the documentation:

The main difference between the two is that min_samples_leaf guarantees a minimum number of samples in a leaf, while min_samples_split can create arbitrary small leaves, though min_samples_split is more common in the literature.

To get a grasp of this piece of documentation I think you should make the distinction between a leaf (also called external node) and an internal node. An internal node will have further splits (also called children), while a leaf is by definition a node without any children (without any further splits).

min_samples_split specifies the minimum number of samples required to split an internal node, while min_samples_leaf specifies the minimum number of samples required to be at a leaf node.

For instance, if min_samples_split = 5, and there are 7 samples at an internal node, then the split is allowed. But let's say the split results in two leaves, one with 1 sample, and another with 6 samples. If min_samples_leaf = 2, then the split won't be allowed (even if the internal node has 7 samples) because one of the leaves resulted will have less then the minimum number of samples required to be at a leaf node.

As the documentation referenced above mentions, min_samples_leaf guarantees a minimum number of samples in every leaf, no matter the value of min_samples_split.

answered Oct 14 '22 03:10

Alex

Related questions
                            
                                How to check if something exists in a postgresql database using django?
                            
                                'negative' pattern matching in python
                            
                                how to share a variable across modules for all tests in py.test
                            
                                Arguments that are dependent on other arguments with Argparse
                            
                                Python Flask-Restful POST not taking JSON arguments
                            
                                Why would I want to use itertools.islice instead of normal list slicing?
                            
                                Generating 15 minute time interval array in python
                            
                                Filter out nan rows in a specific column
                            
                                Unable to locate package python-pip Ubuntu 20.04
                            
                                How to find an index at which a new item can be inserted into sorted list and keep it sorted?
                            
                                scrapy - parsing items that are paginated
                            
                                Flask request and application/json content type
                            
                                XML Unicode strings with encoding declaration are not supported
                            
                                Open File in Another Directory (Python)
                            
                                Python: reduce precision pandas timestamp dataframe
                            
                                Python/Pandas convert string to time only
                            
                                How to subtract strings in python
                            
                                How to bind events to Canvas items?
                            
                                Python List & for-each access (Find/Replace in built-in list)
                            
                                How do I print entire number in Python from describe() function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With