If I understand this correctly, a set of objects (which are arrays of features) is presented and we need to split it into 2 subsets. To do that we compare some feature xj to a threshold tm (tm is the threshold at m node). We use an impurity function H() to find the best way to split the objects. But how do we choose the values of tm and which feature should be compared to the thresholds? I mean, there is an infinite number of ways we can choose tm so we can't just compute H() function for each possibility.
In Page 18 of these slides, two methods are introduced to choose the splitting threshold for a numerical attribute X.
Method 1:
Method 2:
Suppose X is a real-value variable
Define IG(Y|X:t) as H(Y) - H(Y|X:t)
Define H(Y|X:t) = H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)
Then define IG^*(Y|X) = max_t IG(Y|X:t)
For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split
Note, may split on an attribute multiple times, with different thresholds
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With