Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating entropy in decision tree (Machine learning)

I do know formula for calculating entropy:

H(Y) = - ∑ (p(yj) * log2(p(yj)))

In words, select an attribute and for each value check target attribute value ... so p(yj) is the fraction of patterns at Node N are in category yj - one for true in target value and one one for false.

But I have a dataset in which target attribute is price, hence range. How to calculate entropy for this kinda dataset?

(Referred: http://decisiontrees.net/decision-trees-tutorial/tutorial-5-exercise-2/)

like image 726
code muncher Avatar asked Jan 16 '13 16:01

code muncher


People also ask

How do you calculate entropy in a decision tree?

Entropy is a measure of disorder or uncertainty and the goal of machine learning models and Data Scientists in general is to reduce uncertainty. We simply subtract the entropy of Y given X from the entropy of just Y to calculate the reduction of uncertainty about Y given an additional piece of information X about Y.

How do you calculate entropy in machine learning?

For example, in a binary classification problem (two classes), we can calculate the entropy of the data sample as follows: Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1)))

What is entropy in decision tree in machine learning?

Entropy is the measurement of disorder or impurities in the information processed in machine learning. It determines how a decision tree chooses to split data. We can understand the term entropy with any simple example: flipping a coin. When we flip a coin, then there can be two outcomes.


1 Answers

You first need to discretise the data set in some way, like sorting it numerically into a number of buckets. Many methods for discretisation exist, some supervised (ie taking account the value of your target function) and some not. This paper outlines various techniques used in fairly general terms. For more specifics there are plenty of discretisation algorithms in machine learning libraries like Weka.

The entropy of continuous distributions is called differential entropy, and can also be estimated by assuming your data is distributed in some way (normally distributed for example), then estimating underlaying distribution in the normal way, and using this to calculate an entropy value.

like image 72
Vic Smith Avatar answered Nov 10 '22 05:11

Vic Smith