Reading a text file and calculating probability and Shannon's entropy

Tags:

I have a text file (tab separated) and I need to calculate the probability and entropy for each column in the text file. Here is what my text file looks like:

aaa 0.0520852296    0.1648703511    0.1648703511
bbb 0.1062639955    0.1632039268    0.1632039268
ccc 1.4112745088    4.3654577641    4.3654577641
ddd 0.4992644913    0.1648703511    0.1648703511
eeee    0.169058175 0.1632039268    0.1632039268

and so I can calculate the probability using the following code:

import pandas as pd
f=open(mydata,'r')
df = pd.DataFrame(pd.read_csv(f, sep='\t', header=None, names=['val1', 'val2', 'val3']))
print(df)
df.loc[:,"val1":"val3"] = df.loc[:,"val1":"val3"].div(df.sum(axis=0), axis=1)
print(df)

which outputs,

aaa 0.0232736716    0.0328321936    0.0328321936
bbb 0.0474828153    0.0325003428    0.0325003428
ccc 0.6306113983    0.8693349271    0.8693349271
ddd 0.2230904597    0.0328321936    0.0328321936
eeee    0.0755416551    0.0325003428    0.0325003428

And on that output I want to calculate the entropy and gave me the results as output file, and so I have the following code

import math
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in df ])

But I get the following error message:

TypeError: a float is required

Any help is much appreciated. Thank you all

986

asked Jun 26 '15 14:06

user1017373

2 Answers

Your problem is with this line

entropy = - sum([ p * math.log(p) / math.log(2.0) for p in df ])

If you think about (or print out) what p for p in df is giving you (e.g. run print([p for p in df])), you can see that p contains only the headings of the columns. So you are passing a text label into the math functions that expect a float. Hence the error.

apply might work well for you here:

import math

def shannon(col):
    entropy = - sum([ p * math.log(p) / math.log(2.0) for p in col])
    return entropy

sh_df = df.loc[:,'val1':'val3'].apply(shannon,axis=0)

print(sh_df)

Note

as others have pointed out, you might want to tidy up your dataframe by making column 0 an index - then you won't have to use

df.loc[:,'val1':'val3']

So you could import your data using:

df = pd.read_csv(f, sep='\t', header=None, index_col=0, names=['val1', 'val2', 'val3'])

and avoid the need to use the cumbersome loc[:,'val1':'val3'] syntax

167

answered Sep 19 '22 16:09

J Richard Snape

Why don't you fix your data file instead of trying to do so in python code and reducing the readability. It's as simple as

sed 's/ \+/,/g' mydata > my_fixed_data

Just run this on the commandline if you are using linux. It will replace all the the spaces with ,.

mydata

aaa 0.0520852296    0.1648703511    0.1648703511
bbb 0.1062639955    0.1632039268    0.1632039268
ccc 1.4112745088    4.3654577641    4.3654577641
ddd 0.4992644913    0.1648703511    0.1648703511
eeee    0.169058175 0.1632039268    0.1632039268

my_fixed_data

aaa,0.0520852296,0.1648703511,0.1648703511
bbb,0.1062639955,0.1632039268,0.1632039268
ccc,1.4112745088,4.3654577641,4.3654577641
ddd,0.4992644913,0.1648703511,0.1648703511
eeee,0.169058175,0.1632039268,0.1632039268

Then you can simply use the read_csv function like

df = pd.read_csv('my_fixed_data', header=None, index_col=0, names=['val1', 'val2', 'val3'])

Here's what the dataframe now looks like:

          val1      val2      val3
aaa   0.052085  0.164870  0.164870
bbb   0.106264  0.163204  0.163204
ccc   1.411275  4.365458  4.365458
ddd   0.499264  0.164870  0.164870
eeee  0.169058  0.163204  0.163204

I'm sure there must be equivalents for Windows too. Just google it.

You get the TypeError: a float is required error because for p in df gives you the column names and not some float values. You may have to fix it accordingly.

>>> for p in df:
...     print p
...
val1
val2
val3
>>>

answered Sep 20 '22 16:09

lakshayg

Related questions
                            
                                Scipy.optimize.root does not converge in Python while Matlab fsolve works, why?
                            
                                Join lists by value
                            
                                python if statement returns value error
                            
                                Generating N uniform random numbers that sum to M
                            
                                Django 'str' object is not callable. How to deal with it?
                            
                                What is python's not? A special function type?
                            
                                ImportError: No module named opencv after installing python-opencv
                            
                                "Access is denied" while upgrading pip.exe on Windows
                            
                                Why does collections.OrderedDict use try and except to initialize variables?
                            
                                Why does multiprocessing.Queue have no task_done method
                            
                                Celery Logging: consistent way to log inside and outside of a task
                            
                                celery worker not working though rabbitmq has queue buildup
                            
                                Python root logger messages not being logged via handler configured with fileConfig
                            
                                <method-wrapper '__call__' of functools.partial object at 0x1356e10> is not a Python function
                            
                                Extract substrings in python
                            
                                openssl hmac differ from python hmac
                            
                                How to set first column to a constant value of an empty np.zeros numPy matrix? [duplicate]
                            
                                Find Maximum of 3D np.array along Axis = 0
                            
                                what is the equivalent command 'end' of Matlab in python? [duplicate]
                            
                                Check if iterator is sorted

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading a text file and calculating probability and Shannon's entropy

Tags:

python

math

pandas

numpy