Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incremental entropy computation

Let std::vector<int> counts be a vector of positive integers and let N:=counts[0]+...+counts[counts.length()-1] be the the sum of vector components. Setting pi:=counts[i]/N, I compute the entropy using the classic formula H=p0*log2(p0)+...+pn*log2(pn).

The counts vector is changing --- counts are incremented --- and every 200 changes I recompute the entropy. After a quick google and stackoverflow search I couldn't find any method for incremental entropy computation. So the question: Is there an incremental method, like the ones for variance, for entropy computation?

EDIT: Motivation for this question was usage of such formulas for incremental information gain estimation in VFDT-like learners.

Resolved: See this mathoverflow post.

like image 273
blazs Avatar asked Feb 04 '26 11:02

blazs


1 Answers

I derived update formulas and algorithms for entropy and Gini index and made the note available on arXiv. (The working version of the note is available here.) Also see this mathoverflow answer.

For the sake of convenience I am including simple Python code, demonstrating the derived formulas:

from math import log
from random import randint
      
# maps x to -x*log2(x) for x>0, and to 0 otherwise 
h = lambda p: -p*log(p, 2) if p > 0 else 0

# update entropy if new example x comes in 
def update(H, S, x):
  new_S = S+x
  return 1.0*H*S/new_S+h(1.0*x/new_S)+h(1.0*S/new_S)

# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
  S = S1+S2
  return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)

# compute entropy(L) using only `update' function 
def test(L):
  S = 0.0 # sum of the sample elements
  H = 0.0 # sample entropy 
  for x in L:
    H = update(H, S, x)
    S = S+x
  return H

# compute entropy using the classic equation 
def entropy(L):
  n = 1.0*sum(L)
  return sum([h(x/n) for x in L])

# entry point 
if __name__ == "__main__":
  L = [randint(1,100) for k in range(100)]
  M = [randint(100,1000) for k in range(100)]
  
  L_ent = entropy(L)
  L_sum = sum(L)
  
  M_ent = entropy(M)
  M_sum = sum(M)
  
  T = L+M
  
  print("Full = ", entropy(T))
  print("Update = ", update(L_ent, L_sum, M_ent, M_sum))

like image 128
blazs Avatar answered Feb 07 '26 03:02

blazs



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!