Understanding shannon entropy of a data set

Tags:

I'm reading Machine Learning In Action and am going through the decision tree chapter. I understand that decision trees are built such that splitting the data set gives you a way to structure your branches and leafs. This gives you more likely information at the top of the tree and limits how many decisions you need to go through.

The book shows a function determining the shannon entropy of a data set:

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

Where the input data set is an array of arrays where each array represents a potential classifiable feature:

dataSet = [[1, 1, 'yes'],
    [1, 1, 'yes'],
    [1, 0, 'no'],
    [0, 1, 'no'],
    [0, 1, 'no']]

What I don't get is why the shannon entropy function in this book is only ever looking at the last element in the feature array? It looks like its only calculating the entropy for "yes" or "no" items, and not the entropy of any of the other features?

enter image description here

It doesn't make sense to me because the entropy for this data set

dataSet = [[1, 1, 'yes'],
    [1, 'asdfasdf', 'yes'],
    [1900, 0, 'no'],
    [0, 1, 'no'],
    ['ddd', 1, 'no']]

Is the same as the entropy above, even though it has a lot more diverse data.

Shouldn't the other feature elements be counted as well in order to give the total entropy of the data set, or am I misunderstanding what the entropy calculation is supposed to do?

If anyone is curious, the full source (which is where this code came from) for the book is here under the Chapter03 folder.

950

asked May 19 '13 19:05

devshorts

1 Answers

The potential ambiguity here is that the dataset you are looking at contains both features and outcome variable, the outcome variable being in the last column. The problem you are trying to solve for is "Do feature 1 and feature 2 help me predict the Outcome"?

Another way to state this is, if I split my data according to feature 1, do I get better information on the Outcome?

In this case, without splitting, the Outcome variable is [ yes, yes, no, no, no ]. If I split on Feature 1, I get 2 groups: Feature 1 = 0 -> Outcome is [ no, no ] Feature 1 = 1 -> Ouctome is [ yes, yes, no ]

The idea here is to see if you are better off with that split. Initially, you had a certain information, described by the Shannon Entropy of [ yes, yes, no, no, no ]. After the split, you have two groups, with "better information" for the group where Feature 1 = 0: you know in that case that the Outcome is no, and that is measured by the Entropy of [ no, no ].

In other words, the approach is to figure out if out of the Features you have available, there is one which, if used, increased your information on what you care about, that is, the Outcome variable. The tree building will greedily pick the feature with the highest information gain at each step, and then see if it's worth splitting even further the resulting groups.

104

answered Oct 17 '22 00:10

Mathias

Related questions
                            
                                Using sklearn and Python for a large application classification/scraping exercise
                            
                                Python: Invalid Literal for Int() Base 10
                            
                                What's the best way to find unique unhashable unorderable types in Python 3
                            
                                Is it safe to assign list comprehension to the original list?
                            
                                How to extract certain csv data based on the header in python
                            
                                How to compress a string in Python to store it in Redis?
                            
                                Inserting into MySQL table using peewee raises "unknown column" exception
                            
                                Display a default value in form fields Django
                            
                                Python for renaming XML elements
                            
                                Creating command line switches in python
                            
                                Python Counter results to csv file
                            
                                Passing a Fabric env.hosts sting as a variable is not work in function
                            
                                Pointers to static methods in Python
                            
                                Python Alarm Clock
                            
                                Reindexing error makes no sense
                            
                                PTVS or something similar in visual studio 2012 express -desktop
                            
                                Nice general way to always invoke python debugger upon exception
                            
                                Python numpy array replacing
                            
                                export excel with python using xlwt and adjusting width [duplicate]
                            
                                Test Suite in Flask with MongoEngine

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding shannon entropy of a data set

Tags:

python

machine-learning

decision-tree

entropy

devshorts

People also ask

1 Answers

Mathias

Recent Activity

Donate For Us