I'm using sklearn.mixture.GMM in Python, and the results seem to depend on data scaling. In the following code example, I change the overall scaling but I do NOT change the relative scaling of the dimensions. Yet under the three different scaling settings I get completely different results:
from sklearn.mixture import GMM
from numpy import array, shape
from numpy.random import randn
from random import choice
# centroids will be normally-distributed around zero:
truelumps = randn(20, 5) * 10
# data randomly sampled from the centroids:
data = array([choice(truelumps) + randn(5) for _ in xrange(1000)])
for scaler in [0.01, 1, 100]:
scdata = data * scaler
thegmm = GMM(n_components=10)
thegmm.fit(scdata, n_iter=1000)
ll = thegmm.score(scdata)
print sum(ll)
Here's the output I get:
GMM(cvtype='diag', n_components=10)
7094.87886779
GMM(cvtype='diag', n_components=10)
-14681.566456
GMM(cvtype='diag', n_components=10)
-37576.4496656
In principle, I don't think the overall data scaling should matter, and the total log-likelihoods should come out similar each time. But maybe there's an implementation issue I'm overlooking?
I've had an answer via the scikit-learn mailing list: in my code example, the log-likelihood should indeed vary with scale (because we're evaluating point likelihoods, not integrals), by a factor relating to log(scale). So I think my code example in fact shows GMM giving correct results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With