I have two GMMs that I used to fit two different sets of data in the same space, and I would like to calculate the KL-divergence between them.
Currently I am using the GMMs defined in sklearn (http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html) and the SciPy implementation of KL-divergence (http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html)
How would I go about doing this? Do I want to just create tons of random points, get their probabilities on each of the two models (call them P and Q) and then use those probabilities as my input? Or is there some more canonical way to do this within the SciPy/SKLearn environment?
The Kullback-Leibler Divergence score, or KL divergence score, quantifies how much one probability distribution differs from another probability distribution. The KL divergence between two distributions Q and P is often stated using the following notation: KL(P || Q)
If two distributions perfectly match, D_{KL} (p||q) = 0 otherwise it can take values between 0 and ∞. Lower the KL divergence value, the better we have matched the true distribution with our approximation.
The Kullback-Leibler divergence is not symmetric, i.e., KL(p||q)≠KL(q||p) and it can be shown that it is a nonnegative quantity (the proof is similar to the proof that the mutual information is nonnegative; see Problem 12.16 of Chapter 12). Moreover, it is zero if and only if p = q.
Although the KL divergence measures the “distance” between two distri- butions, it is not a distance measure. This is because that the KL divergence is not a metric measure. It is not symmetric: the KL from p(x) to q(x) is generally not the same as the KL from q(x) to p(x).
There's no closed form for the KL divergence between GMMs. You can easily do Monte Carlo, though. Recall that KL(p||q) = \int p(x) log(p(x) / q(x)) dx = E_p[ log(p(x) / q(x))
. So:
def gmm_kl(gmm_p, gmm_q, n_samples=10**5):
X = gmm_p.sample(n_samples)
log_p_X, _ = gmm_p.score_samples(X)
log_q_X, _ = gmm_q.score_samples(X)
return log_p_X.mean() - log_q_X.mean()
(mean(log(p(x) / q(x))) = mean(log(p(x)) - log(q(x))) = mean(log(p(x))) - mean(log(q(x)))
is somewhat cheaper computationally.)
You don't want to use scipy.stats.entropy
; that's for discrete distributions.
If you want the symmetrized and smoothed Jensen-Shannon divergence KL(p||(p+q)/2) + KL(q||(p+q)/2)
instead, it's pretty similar:
def gmm_js(gmm_p, gmm_q, n_samples=10**5):
X = gmm_p.sample(n_samples)
log_p_X, _ = gmm_p.score_samples(X)
log_q_X, _ = gmm_q.score_samples(X)
log_mix_X = np.logaddexp(log_p_X, log_q_X)
Y = gmm_q.sample(n_samples)
log_p_Y, _ = gmm_p.score_samples(Y)
log_q_Y, _ = gmm_q.score_samples(Y)
log_mix_Y = np.logaddexp(log_p_Y, log_q_Y)
return (log_p_X.mean() - (log_mix_X.mean() - np.log(2))
+ log_q_Y.mean() - (log_mix_Y.mean() - np.log(2))) / 2
(log_mix_X
/log_mix_Y
are actually the log of twice the mixture densities; pulling that out of the mean operation saves some flops.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With