Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How decision tree calculate the splitting attribute?

When we are using any decision tree algorithm and our data set consists of numerical values.

I have found that the results provided by the program splits the node on values that are not even exist in the data set

Example:
Classifications Results

  1. attrib2 <= 3.761791861252009 : groupA
  2. attrib2 > 3.761791861252009 : groupB

where as the in my dataset there is no value for attrib2 like 3.76179. Why it is like that?

like image 789
Zia Avatar asked Jun 09 '11 08:06

Zia


3 Answers

Most decision tree building algorithms (J48, C4.5, CART, ID3) work as follows:

  • Sort the attributes that you can split on.
  • Find all the "breakpoints" where the class labels associated with them change.
  • Consider the split points where the labels change. Pick the one that minimizes the purity measure. The information gain depends only on the ordering however, not on the value.

Once you've found the best split point, algorithms disagree on how represent it. Example: say you have -4 (Yes), -3 (Yes) , -3 (Yes), -2 (No), -1 (No). Any value between -3 and -2 will have the same purity. Some algorithms (C4.5) will say val <= -3. Others, e.g. Weka, will choose the average and give val <= -2.5.

like image 166
Waleed K Avatar answered Oct 10 '22 06:10

Waleed K


There are several ways to choose an attribute. And not all of them choose values in the data set.

A common one (though a bit simplistic) is to take the mean. It is possible that 3.76179... is the mean of all attrib2 of your data set.

For example, if your data set is 1 dimensional, and is made of the value -10, -9, .. -2, -1, 1, 2, ..9, 10 then a good splitting value would be 0, even though it's not in your data set.

Another possibility, especially if you're dealing with random forests (several decision trees) is that the splitting value is chosen at random, with a probability distribution centered around the median value. Some algorithms decide to split according to a gaussian centered on the mean/median value and with deviation equal to the standard deviation of the data set.

like image 45
B. Decoster Avatar answered Oct 10 '22 04:10

B. Decoster


First you can check how to discretise numeric value. Those algorithms split numeric value range into several interval each of which has big infogain. For example you go with step 0.1 after each split you check its infogain and select best position, then you continue with spited intervals.

like image 33
yura Avatar answered Oct 10 '22 04:10

yura