In my project I need to compute the entropy of 0-1 vectors many times. Here's my code: <pre class="prettyprint"><code>def entropy(labels): """ Computes entropy of 0-1 vector. """ n_labels = len(labels) if n_labels <= 1: return 0 counts = np.bincount(labels) probs = counts[np.nonzero(counts)] / n_labels n_classes = len(probs) if n_classes <= 1: return 0 return - np.sum(probs * np.log(probs)) / np.log(n_classes) </code></pre> Is there a faster way?

@Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations. Four different approaches: (1) scipy/numpy, (2) numpy/math, (3) pandas/numpy, (4) numpy <pre class="prettyprint"><code>import numpy as np from scipy.stats import entropy from math import log, e import pandas as pd import timeit def entropy1(labels, base=None): value,counts = np.unique(labels, return_counts=True) return entropy(counts, base=base) def entropy2(labels, base=None): """ Computes entropy of label distribution. """ n_labels = len(labels) if n_labels <= 1: return 0 value,counts = np.unique(labels, return_counts=True) probs = counts / n_labels n_classes = np.count_nonzero(probs) if n_classes <= 1: return 0 ent = 0. # Compute entropy base = e if base is None else base for i in probs: ent -= i * log(i, base) return ent def entropy3(labels, base=None): vc = pd.Series(labels).value_counts(normalize=True, sort=False) base = e if base is None else base return -(vc * np.log(vc)/np.log(base)).sum() def entropy4(labels, base=None): value,counts = np.unique(labels, return_counts=True) norm_counts = counts / counts.sum() base = e if base is None else base return -(norm_counts * np.log(norm_counts)/np.log(base)).sum() </code></pre> Timeit operations: <pre class="prettyprint"><code>repeat_number = 1000000 a = timeit.repeat(stmt='''entropy1(labels)''', setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''', repeat=3, number=repeat_number) b = timeit.repeat(stmt='''entropy2(labels)''', setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''', repeat=3, number=repeat_number) c = timeit.repeat(stmt='''entropy3(labels)''', setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''', repeat=3, number=repeat_number) d = timeit.repeat(stmt='''entropy4(labels)''', setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''', repeat=3, number=repeat_number) </code></pre> Timeit results: <pre class="prettyprint"><code># for loop to print out results of timeit for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]): print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean())) Method: scipy/numpy, Avg.: 63.315312 Method: numpy/math, Avg.: 49.256894 Method: pandas/numpy, Avg.: 884.644023 Method: numpy, Avg.: 60.026938 </code></pre> Winner: numpy/math (<code>entropy2</code>) It's also worth noting that the <code>entropy2</code> function above can handle numeric AND text data. ex: <code>entropy2(list('abcdefabacdebcab'))</code>. The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.

Fastest way to compute entropy in Python

Tags:

python

numpy

entropy

In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:

def entropy(labels):     """ Computes entropy of 0-1 vector. """     n_labels = len(labels)      if n_labels <= 1:         return 0      counts = np.bincount(labels)     probs = counts[np.nonzero(counts)] / n_labels     n_classes = len(probs)      if n_classes <= 1:         return 0     return - np.sum(probs * np.log(probs)) / np.log(n_classes)

Is there a faster way?

355

asked Mar 16 '13 14:03

blueSurfer

1 Answers

@Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.

Four different approaches: (1) scipy/numpy, (2) numpy/math, (3) pandas/numpy, (4) numpy

import numpy as np from scipy.stats import entropy from math import log, e import pandas as pd  import timeit  def entropy1(labels, base=None):   value,counts = np.unique(labels, return_counts=True)   return entropy(counts, base=base)  def entropy2(labels, base=None):   """ Computes entropy of label distribution. """    n_labels = len(labels)    if n_labels <= 1:     return 0    value,counts = np.unique(labels, return_counts=True)   probs = counts / n_labels   n_classes = np.count_nonzero(probs)    if n_classes <= 1:     return 0    ent = 0.    # Compute entropy   base = e if base is None else base   for i in probs:     ent -= i * log(i, base)    return ent  def entropy3(labels, base=None):   vc = pd.Series(labels).value_counts(normalize=True, sort=False)   base = e if base is None else base   return -(vc * np.log(vc)/np.log(base)).sum()  def entropy4(labels, base=None):   value,counts = np.unique(labels, return_counts=True)   norm_counts = counts / counts.sum()   base = e if base is None else base   return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()

Timeit operations:

repeat_number = 1000000  a = timeit.repeat(stmt='''entropy1(labels)''',                   setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',                   repeat=3, number=repeat_number)  b = timeit.repeat(stmt='''entropy2(labels)''',                   setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',                   repeat=3, number=repeat_number)  c = timeit.repeat(stmt='''entropy3(labels)''',                   setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',                   repeat=3, number=repeat_number)  d = timeit.repeat(stmt='''entropy4(labels)''',                   setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',                   repeat=3, number=repeat_number)

Timeit results:

# for loop to print out results of timeit for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):   print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))  Method: scipy/numpy, Avg.: 63.315312 Method: numpy/math, Avg.: 49.256894 Method: pandas/numpy, Avg.: 884.644023 Method: numpy, Avg.: 60.026938

Winner: numpy/math (entropy2)

It's also worth noting that the entropy2 function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab')). The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.

answered Sep 28 '22 00:09

Jarad

Related questions
                            
                                Financial technical analysis in python [closed]
                            
                                How to plot time series in python
                            
                                Programmatically searching google in Python using custom search
                            
                                Python-like list comprehension in Java
                            
                                Plotting networkx graph with node labels defaulting to node name
                            
                                Python sum, why not strings? [closed]
                            
                                How to sort OrderedDict of OrderedDict?
                            
                                Access a function variable outside the function without using "global"
                            
                                How can I tell which python implementation I'm using?
                            
                                Pandas: Return Hour from Datetime Column Directly
                            
                                Writing an mp4 video using python opencv
                            
                                Get first letter of a string from column
                            
                                Python Try Catch Block inside lambda
                            
                                Webdriver Screenshot
                            
                                Get parent class name? [duplicate]
                            
                                How do I get a python program to do nothing?
                            
                                Reading multiple JSON records into a Pandas dataframe
                            
                                Tensorflow Tensorboard default port
                            
                                Executing Javascript from Python
                            
                                Subtract hours and minutes from time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With