Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "splitter" attribute in sklearn's DecisionTreeClassifier do?

The sklearn DecisionTreeClassifier has a attribute called "splitter" , it is set to "best" by default, what does setting it to "best" or "random" do? I couldn't find enough information from the official documentation.

like image 791
Vijayabhaskar J Avatar asked Oct 15 '17 15:10

Vijayabhaskar J


People also ask

What is min samples split in decision tree?

min_samples_split specifies the minimum number of samples required to split an internal node, while min_samples_leaf specifies the minimum number of samples required to be at a leaf node. For instance, if min_samples_split = 5 , and there are 7 samples at an internal node, then the split is allowed.

Which attributes should be selected as the splitting attribute in decision tree?

#3) Gini Index The maximum reduction in impurity or max Gini index is selected as the best attribute for splitting.

How do you split a decision tree?

The process of splitting a single node into many nodes is known as splitting. A leaf node, also known as a terminal node, is a node that does not break into other nodes. A branch, sometimes known as a sub-tree, is a section of a decision tree. Splitting is not the only concept that is diametrically opposite it.

What is the difference between Min_sample_split and Min_sample_leaf?

A low number in min_sample_split and min_sample_leaf allows the model to differentiate between samples. A low number in min_sample_split , for example, allows the decision tree to split 2 samples into different groups, while the min_sample_leaf dictates how many samples minimum can be in each "classification."


2 Answers

Short ans:

RandomSplitter initiates a **random split on each chosen feature**, whereas BestSplitter goes through **all possible splits on each chosen feature**.


Longer explanation:

This is clear when you go thru _splitter.pyx.

  • RandomSplitter calculates improvement only on threshold that is randomly initiated (ref. lines 761 and 801). BestSplitter goes through all possible splits in a while loop (ref. lines 436 (which is where loop starts) and 462). [Note: Lines are in relation to version 0.21.2.]
  • As opposed to earlier responses from 15 Oct 2017 and 1 Feb 2018, RandomSplitter and BestSplitter both loop through all relevant features. This is also evident in _splitter.pyx.
  • like image 138
    JSong Avatar answered Sep 28 '22 12:09

    JSong


    In fact, the "random" parameter is used for implementing the extra randomized tree in sklearn. In a nutshell, this parameter means that the splitting algorithm will traverse all features but only randomly choose the splitting point between the maximum feature value and the minimum feature value. If you are interested in the algorithm's details, you can refer to this paper [1]. Moreover, if you are interested in the detailed implementation of this algorithm, you can refer to this page.

    [1]. P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.

    like image 38
    zhenlingcn Avatar answered Sep 28 '22 12:09

    zhenlingcn