In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:
def entropy(labels): """ Computes entropy of 0-1 vector. """ n_labels = len(labels) if n_labels <= 1: return 0 counts = np.bincount(labels) probs = counts[np.nonzero(counts)] / n_labels n_classes = len(probs) if n_classes <= 1: return 0 return - np.sum(probs * np.log(probs)) / np.log(n_classes)
Is there a faster way?
Calculate the entropy of a distribution for given probability values. If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=axis) . If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=axis) .
EntroPy is a Python 3 package providing several time-efficient algorithms for computing the complexity of time-series. It can be used for example to extract features from EEG signals.
The conditional entropy also needs the two arrays to be of equal lenght. In fact you can calculate it from joint entropy and individual entropies -> H(X|Y) = H(X,Y) - H(Y). Perhaps if you give more details, it will be easier to help.
@Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.
Four different approaches: (1) scipy/numpy, (2) numpy/math, (3) pandas/numpy, (4) numpy
import numpy as np from scipy.stats import entropy from math import log, e import pandas as pd import timeit def entropy1(labels, base=None): value,counts = np.unique(labels, return_counts=True) return entropy(counts, base=base) def entropy2(labels, base=None): """ Computes entropy of label distribution. """ n_labels = len(labels) if n_labels <= 1: return 0 value,counts = np.unique(labels, return_counts=True) probs = counts / n_labels n_classes = np.count_nonzero(probs) if n_classes <= 1: return 0 ent = 0. # Compute entropy base = e if base is None else base for i in probs: ent -= i * log(i, base) return ent def entropy3(labels, base=None): vc = pd.Series(labels).value_counts(normalize=True, sort=False) base = e if base is None else base return -(vc * np.log(vc)/np.log(base)).sum() def entropy4(labels, base=None): value,counts = np.unique(labels, return_counts=True) norm_counts = counts / counts.sum() base = e if base is None else base return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()
Timeit operations:
repeat_number = 1000000 a = timeit.repeat(stmt='''entropy1(labels)''', setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''', repeat=3, number=repeat_number) b = timeit.repeat(stmt='''entropy2(labels)''', setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''', repeat=3, number=repeat_number) c = timeit.repeat(stmt='''entropy3(labels)''', setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''', repeat=3, number=repeat_number) d = timeit.repeat(stmt='''entropy4(labels)''', setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''', repeat=3, number=repeat_number)
Timeit results:
# for loop to print out results of timeit for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]): print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean())) Method: scipy/numpy, Avg.: 63.315312 Method: numpy/math, Avg.: 49.256894 Method: pandas/numpy, Avg.: 884.644023 Method: numpy, Avg.: 60.026938
Winner: numpy/math (entropy2
)
It's also worth noting that the entropy2
function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab'))
. The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With