Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating probability with sklearn GMM

I want to determine the probability that a data point belongs to a population of data. I read that sklearn GMM can do this. I tried the following....

import numpy as np
from sklearn.mixture import GMM

training_data = np.hstack((
    np.random.normal(500, 100, 2000).reshape(-1, 1),
    np.random.normal(500, 100, 2000).reshape(-1, 1),
))

# train the classifier and get max score
g = GMM(n_components=1)
g.fit(training_data)
scores = g.score(training_data)
max_score = np.amax(scores)

# create a candidate data point and calculate the probability
# it belongs to the training population
candidate_data = np.array([[490, 450]])
candidate_score = g.score(candidate_data)

From this point on I'm not sure what to do. I was reading that I have to normalize the log probability in order to get the probability of a candidate data point belonging to a population. Would that be something like this...

candidate_probability = (np.exp(candidate_score)/np.exp(max_score)) * 100

print candidate_probability
>>> [ 87.81751913]

The number does not seem unreasonable, but I'm really out of my comfort zone here so I thought I'd ask. Thanks!

like image 505
b10hazard Avatar asked Nov 10 '22 21:11

b10hazard


1 Answers

The candidate_probability you are using would not be statistically correct. I think that what you would have to do is calculate the probabilities of the sample point being a member of only one of the individual gaussians (from the weights and multivariate cumulative distribution functions (CDFs)) and sum up those probabilities. The largest problem is that I cannot find a good python package that would calculate the multivariate CDFs. Unless you are able to find one, this paper would be a good starting point https://upload.wikimedia.org/wikipedia/commons/a/a2/Cumulative_function_n_dimensional_Gaussians_12.2013.pdf

like image 123
user2343530 Avatar answered Nov 14 '22 22:11

user2343530