Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a text file and calculating probability and Shannon's entropy

I have a text file (tab separated) and I need to calculate the probability and entropy for each column in the text file. Here is what my text file looks like:

aaa 0.0520852296    0.1648703511    0.1648703511
bbb 0.1062639955    0.1632039268    0.1632039268
ccc 1.4112745088    4.3654577641    4.3654577641
ddd 0.4992644913    0.1648703511    0.1648703511
eeee    0.169058175 0.1632039268    0.1632039268

and so I can calculate the probability using the following code:

import pandas as pd
f=open(mydata,'r')
df = pd.DataFrame(pd.read_csv(f, sep='\t', header=None, names=['val1', 'val2', 'val3']))
print(df)
df.loc[:,"val1":"val3"] = df.loc[:,"val1":"val3"].div(df.sum(axis=0), axis=1)
print(df)

which outputs,

aaa 0.0232736716    0.0328321936    0.0328321936
bbb 0.0474828153    0.0325003428    0.0325003428
ccc 0.6306113983    0.8693349271    0.8693349271
ddd 0.2230904597    0.0328321936    0.0328321936
eeee    0.0755416551    0.0325003428    0.0325003428

And on that output I want to calculate the entropy and gave me the results as output file, and so I have the following code

import math
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in df ])

But I get the following error message:

TypeError: a float is required

Any help is much appreciated. Thank you all

like image 986
user1017373 Avatar asked Jun 26 '15 14:06

user1017373


People also ask

How do you find the entropy of a text file?

To compute Entropy the frequency of occurrence of each character must be found out. The probability of occurrence of each character can therefore be found out by dividing each character frequency value by the length of the string message.

How do you find the probability of entropy?

This is the quantity that he called entropy, and it is represented by H in the following formula: H = p1 logs(1/p1) + p2 logs(1/p2) + ⋯ + pk logs(1/pk).

How do you interpret Shannon entropy?

Meaning of Entropy At a conceptual level, Shannon's Entropy is simply the "amount of information" in a variable. More mundanely, that translates to the amount of storage (e.g. number of bits) required to store the variable, which can intuitively be understood to correspond to the amount of information in that variable.


2 Answers

Your problem is with this line

entropy = - sum([ p * math.log(p) / math.log(2.0) for p in df ])

If you think about (or print out) what p for p in df is giving you (e.g. run print([p for p in df])), you can see that p contains only the headings of the columns. So you are passing a text label into the math functions that expect a float. Hence the error.

apply might work well for you here:

import math

def shannon(col):
    entropy = - sum([ p * math.log(p) / math.log(2.0) for p in col])
    return entropy

sh_df = df.loc[:,'val1':'val3'].apply(shannon,axis=0)

print(sh_df)

Note

as others have pointed out, you might want to tidy up your dataframe by making column 0 an index - then you won't have to use

df.loc[:,'val1':'val3']

So you could import your data using:

df = pd.read_csv(f, sep='\t', header=None, index_col=0, names=['val1', 'val2', 'val3'])

and avoid the need to use the cumbersome loc[:,'val1':'val3'] syntax

like image 167
J Richard Snape Avatar answered Sep 19 '22 16:09

J Richard Snape


Why don't you fix your data file instead of trying to do so in python code and reducing the readability. It's as simple as

sed 's/ \+/,/g' mydata > my_fixed_data

Just run this on the commandline if you are using linux. It will replace all the the spaces with ,.

mydata

aaa 0.0520852296    0.1648703511    0.1648703511
bbb 0.1062639955    0.1632039268    0.1632039268
ccc 1.4112745088    4.3654577641    4.3654577641
ddd 0.4992644913    0.1648703511    0.1648703511
eeee    0.169058175 0.1632039268    0.1632039268

my_fixed_data

aaa,0.0520852296,0.1648703511,0.1648703511
bbb,0.1062639955,0.1632039268,0.1632039268
ccc,1.4112745088,4.3654577641,4.3654577641
ddd,0.4992644913,0.1648703511,0.1648703511
eeee,0.169058175,0.1632039268,0.1632039268

Then you can simply use the read_csv function like

df = pd.read_csv('my_fixed_data', header=None, index_col=0, names=['val1', 'val2', 'val3'])

Here's what the dataframe now looks like:

          val1      val2      val3
aaa   0.052085  0.164870  0.164870
bbb   0.106264  0.163204  0.163204
ccc   1.411275  4.365458  4.365458
ddd   0.499264  0.164870  0.164870
eeee  0.169058  0.163204  0.163204

I'm sure there must be equivalents for Windows too. Just google it.

You get the TypeError: a float is required error because for p in df gives you the column names and not some float values. You may have to fix it accordingly.

>>> for p in df:
...     print p
...
val1
val2
val3
>>>
like image 31
lakshayg Avatar answered Sep 20 '22 16:09

lakshayg