I am using the SciPy implementation of KL-divergence ([http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html]) for two different numpy arrays.
The first one , lets say "base_freq" has a standard length of 2000 The second one, "test_freq" length can take different values depending on the sample. So let's say that its length is 8000.
How can I compute the KL divergence when these two do not have the same length???
My thought was to break down the second array ("test_freq") to a number of arrays with 2000 length. But how is this done?? And what happens when the "test_freq" gets a sample of 250 length ?
I should preface by saying that I'm no information theory expert. For the one application in which I used KL-divergence, I was comparing two images pixel-wise to compute the number of bits lost. If the images had different sizes, your proposed approach would require that for each pixel in the smaller image I choose the corresponding pixel in the larger--not any old pixel. My understanding was that KL-divergence only makes sense if you're comparing two signals sampled the same way (i.e. the same temporal or spatial sampling interval).
If you want to do what you propose, you may use numpy.random.choice
:
import numpy as np
def uneven_kl_divergence(pk,qk):
if len(pk)>len(qk):
pk = np.random.choice(pk,len(qk))
elif len(qk)>len(pk):
qk = np.random.choice(qk,len(pk))
return np.sum(pk * np.log(pk/qk))
Disclaimer: I am not a Statistics expert.
KL-Divergence is measure between probability distributions. That means you have to make sure the inputs for your entropy function are two valid probability distributions from the same sample space.
In your case, you have a finite number of possible values, so you have a discrete random variable. This also means that each outcome of your variable can be measured as a frequency of occurrences over a number of trials.
Let me give you a simple example. Let's say your random variable represents a non-perfect dice that has 6 possible outcomes (6 sides). You throw the dice 100 times.
Imagine that you got the following drawing distribution:
1: 10 times
2: 12 times
3: 08 times
4: 30 times
5: 20 times
6: 20 times
Since each outcome (side) happened a number of times, you just have to divide each outcome count by 100. This is your frequency, which is also you probability.
So we now have:
P(side=1) = 10/100 = .10
P(side=2) = 12/100 = .12
P(side=3) = 08/100 = .08
P(side=4) = 30/100 = .30
P(side=5) = 20/100 = .20
P(side=6) = 20/100 = .20
And finally, here is your probability distribution:
[.10, .12, .08, .30, .20, .20]
Notice it sums up to 1, as it is expected of a probability distribution.
If you do a second experiment and come up with a different probability distribution, it will still have 6 probabilities, even if your number of trials this time is not 100.
This is all to say that it does not make sense to compare two probability distributions from different sample spaces. If you have a way of converting from a sample space to another, it would be possible. However, be sure you probability distributions are representations from the same sample space. It does not make sense to compare probabilities of a 6-sided dice and a 8-sided dice, because they don't represent the same thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With