Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decision tree using continuous variable [closed]

I have a question about Decision tree using continuous variable

I heard that when output variable is continuous and input variable is categorical, split criteria is reducing variance or something. but I don't know how it work if input variable is continuous

  1. input variable : continuous / output variable : categorical

  2. input variable : continuous / output variable : continuous

About two cases, how we can get a split criteria like gini index or information gain?

When I use rpart in R, whatever input variable and output variable are it works well, but I don't know the algorithm in detail.

like image 398
BSKim Avatar asked Nov 30 '16 13:11

BSKim


People also ask

Can decision trees be used for continuous variables?

The benefit of a continuous variable decision tree is that the outcome can be predicted based on multiple variables rather than on a single variable as in a categorical variable decision tree. Continuous variable decision trees are used to create predictions.

Which algorithm can handle continuous data for decision tree?

List down the attribute selection measures used by the ID3 algorithm to construct a Decision Tree. The most widely used algorithm for building a Decision Tree is called ID3.


2 Answers

1) input variable : continuous / output variable : categorical
C4.5 algorithm solve this situation. C4.5

In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.

2) input variable : continuous / output variable : continuous
CART(classification and regression trees) algorithm solves this situation. CART

Case 2 is the regression problem. You should enumerate the attribute j, and enumerate the values s in that attribute, and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Then you get two areas enter image description here

Find the best attribute j and the best split value s, which

enter image description here

c_1 and c_2 and be solved as follows:

enter image description here

Then when do regression,
enter image description here

where

enter image description here

like image 175
Vito Avatar answered Sep 21 '22 22:09

Vito


I can explain the concept at a very high level.

The main goal of the algorithm is to find an attribute that we will use for the first split. We can use various impurity metrics to evaluate the most significant attribute. Those impurity metrics can be Information Gain, Entropy, Gain Ratio, etc. But, if the decision variable is a continuous type variable, then we usually use another impurity metric 'standard deviation reduction'. But, whatever metric you use, depending on your algorithm (i.e. ID3, C4.5, etc) you actually find an attribute that will be used for splitting.

When you have a continuous type attribute, then things get a little tricky. You need to find a threshold value for an attribute that will give you the highest impurity (Entropy, Gain Ratio, Information Gain ... whatever). Then, you find which attribute's threshold value gives that highest impurity, and then chose an attribute accordingly, right?

Now, if the attribute is a continuous type and decision variable is also continuous type, then you can simply combine the above two concepts and generate the Regression Tree.

That means, as the decision variable is continuous type, you will use the metric (like Variance reduction) and chose the attribute which will give you the highest value of the chosen metric (i.e. variance reduction) for the threshold value of all attributes.

You can visualize such a regression tree using a Decision Tree Machine Learning software like SpiceLogic Decision Tree Software Say, you have a data table like this:

enter image description here

The software will generate the Regression tree like this:

enter image description here

like image 39
Emran Hussain Avatar answered Sep 20 '22 22:09

Emran Hussain