Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to compute entropy in Python

In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:

def entropy(labels):     """ Computes entropy of 0-1 vector. """     n_labels = len(labels)      if n_labels <= 1:         return 0      counts = np.bincount(labels)     probs = counts[np.nonzero(counts)] / n_labels     n_classes = len(probs)      if n_classes <= 1:         return 0     return - np.sum(probs * np.log(probs)) / np.log(n_classes) 

Is there a faster way?

like image 355
blueSurfer Avatar asked Mar 16 '13 14:03

blueSurfer


People also ask

How do you calculate entropy in Python?

Calculate the entropy of a distribution for given probability values. If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=axis) . If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=axis) .

What is entropy in Python?

EntroPy is a Python 3 package providing several time-efficient algorithms for computing the complexity of time-series. It can be used for example to extract features from EEG signals.

How do you calculate joint entropy in Python?

The conditional entropy also needs the two arrays to be of equal lenght. In fact you can calculate it from joint entropy and individual entropies -> H(X|Y) = H(X,Y) - H(Y). Perhaps if you give more details, it will be easier to help.


1 Answers

@Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.

Four different approaches: (1) scipy/numpy, (2) numpy/math, (3) pandas/numpy, (4) numpy

import numpy as np from scipy.stats import entropy from math import log, e import pandas as pd  import timeit  def entropy1(labels, base=None):   value,counts = np.unique(labels, return_counts=True)   return entropy(counts, base=base)  def entropy2(labels, base=None):   """ Computes entropy of label distribution. """    n_labels = len(labels)    if n_labels <= 1:     return 0    value,counts = np.unique(labels, return_counts=True)   probs = counts / n_labels   n_classes = np.count_nonzero(probs)    if n_classes <= 1:     return 0    ent = 0.    # Compute entropy   base = e if base is None else base   for i in probs:     ent -= i * log(i, base)    return ent  def entropy3(labels, base=None):   vc = pd.Series(labels).value_counts(normalize=True, sort=False)   base = e if base is None else base   return -(vc * np.log(vc)/np.log(base)).sum()  def entropy4(labels, base=None):   value,counts = np.unique(labels, return_counts=True)   norm_counts = counts / counts.sum()   base = e if base is None else base   return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()      

Timeit operations:

repeat_number = 1000000  a = timeit.repeat(stmt='''entropy1(labels)''',                   setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',                   repeat=3, number=repeat_number)  b = timeit.repeat(stmt='''entropy2(labels)''',                   setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',                   repeat=3, number=repeat_number)  c = timeit.repeat(stmt='''entropy3(labels)''',                   setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',                   repeat=3, number=repeat_number)  d = timeit.repeat(stmt='''entropy4(labels)''',                   setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',                   repeat=3, number=repeat_number) 

Timeit results:

# for loop to print out results of timeit for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):   print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))  Method: scipy/numpy, Avg.: 63.315312 Method: numpy/math, Avg.: 49.256894 Method: pandas/numpy, Avg.: 884.644023 Method: numpy, Avg.: 60.026938 

Winner: numpy/math (entropy2)

It's also worth noting that the entropy2 function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab')). The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.

like image 69
Jarad Avatar answered Sep 28 '22 00:09

Jarad