Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Producing an MLE for a pair of distributions in python

Ok, so my current curve fitting code has a step that uses scipy.stats to determine the right distribution based on the data,

distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
mles = []

for distribution in distributions:
    pars = distribution.fit(data)
    mle = distribution.nnlf(pars, data)
    mles.append(mle)

results = [(distribution.name, mle) for distribution, mle in zip(distributions, mles)]

for dist in sorted(zip(distributions, mles), key=lambda d: d[1]):
    print dist
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])          


print [mod[0].name for mod in sorted(zip(distributions, mles), key=lambda d: d[1])]

Where data is a list of numeric values. This is working great so far for fitting unimodal distributions, confirmed in a script that randomly generates values from random distributions and uses curve_fit to redetermine the parameters.

A fitted normal distribution

Now I would like to make the code able to handle bimodal distributions, like the example below:

A normal and an exponential distribution combined

Is it possible to get a MLE for a pair of models from scipy.stats in order to determine if a particular pair of distributions are a good fit for the data?, something like

distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
distributionPairs = [[modelA.name, modelB.name] for modelA in distributions for modelB in distributions]

and use those pairs to get an MLE value of that pair of distributions fitting the data?

like image 228
BruceJohnJennerLawso Avatar asked Feb 03 '17 22:02

BruceJohnJennerLawso


1 Answers

It's not a complete answer but it may help you to solve your problem. Let say you know your problem is generated by two densities. A solution would be to use k-mean or EM algorithm.

Initalization. You initialize your algorithm by affecting every observation to one or the other density. And you initialize the two densities (you initialize the parameters of the density, and one of the parameter in your case is "gaussian", "laplace", and so on... Iteration. Then, iterately, you run the two following steps :

Step 1. Optimize the parameters assuming that the affectation of every point is right. You can now use any optimization solver. This step provide you with an estimation of the best two densities (with given parameter) that fit your data.

Step 2. You classify every observation to one density or the other according to the greatest likelihood.

You repeat until convergence.

This is very well explained in this web-page https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html

If you do not know how many densities have generated your data, the problem is more difficult. You have to work with penalized classification problem, which is a bit harder.

Here is a coding example in an easy case : you know that your data comes from 2 different Gaussians (you don't know how many variables are generated from each density). In your case, you can adjust this code to loop on every possible pair of density (computationally longer, but would empirically work I presume)

import scipy.stats as st
import numpy as np

#hard coded data generation
data = np.random.normal(-3, 1, size = 1000)
data[600:] = np.random.normal(loc = 3, scale = 2, size=400)

#initialization

mu1 = -1
sigma1 = 1

mu2 = 1
sigma2 = 1

#criterion to stop iteration
epsilon = 0.1
stop = False

while  not stop :  
    #step1
    classification = np.zeros(len(data))
    classification[st.norm.pdf(data, mu1, sigma1) > st.norm.pdf(data, mu2, sigma2)] = 1

    mu1_old, mu2_old, sigma1_old, sigma2_old = mu1, mu2, sigma1, sigma2

    #step2
    pars1 = st.norm.fit(data[classification == 1])
    mu1, sigma1 = pars1
    pars2 = st.norm.fit(data[classification == 0])
    mu2, sigma2 = pars2

    #stopping criterion
    stop = ((mu1_old - mu1)**2 + (mu2_old - mu2)**2 +(sigma1_old - sigma1)**2 +(sigma2_old - sigma2)**2) < epsilon

#result    
print("The first density is gaussian :", mu1, sigma1)
print("The first density is gaussian :", mu2, sigma2)
print("A rate of ", np.mean(classification), "is classified in the first density")

Hope it helps.

like image 171
probaPerception Avatar answered Oct 20 '22 00:10

probaPerception