Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mixture model fitting (Bimodal?) in SciPy using truncated normals. Python 3

I'm trying to follow this as an example but can't seem to adapt it to work with my dataset since I need truncated normals: https://stackoverflow.com/questions/35990467/fit-two-gaussians-to-a-histogram-from-one-set-of-data-python#=

I have a dataset that is definitely a mixture of 2 truncated normals. The minimum value in the domain is 0 and the maximum is 1. I want to create an object that I can fit to optimize the parameters and get the likelihood of a sequence of numbers being drawn from that distribution. One option may be to just use the KDE model and using the pdf to get the likelihood. However, I want the exact mean and standard deviations of the 2 distributions. I guess I could, split the data in half and then model the 2 normals separately but I also want to learn how to use optimize in SciPy. I'm just starting to experiment with this type of statistical analysis so my apologies if this seems naive.

I'm not sure how to get a pdf this way that can integrate to 1 and have a domain constrained between 0 and 1.

import requests
from ast import literal_eval
from scipy import optimize, stats
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# Actual Data
u = np.asarray(literal_eval(requests.get("https://pastebin.com/raw/hP5VJ9vr").text))
# u.size ==> 6000
u.min(), u.max()
# (1.3628525454666037e-08, 0.99973136607553781)

# Distribution
with plt.style.context("seaborn-white"):
    fig, ax = plt.subplots()
    sns.kdeplot(u, color="black", ax=ax)
    ax.axvline(0, linestyle=":", color="red")
    ax.axvline(1, linestyle=":", color="red")
kde = stats.gaussian_kde(u)

enter image description here

# KDE Model
def truncated_gaussian_lower(x,mu,sigma,A):
    return np.clip(A*np.exp(-(x-mu)**2/2/sigma**2), a_min=0, a_max=None)
def truncated_gaussian_upper(x,mu,sigma,A):
    return np.clip(A*np.exp(-(x-mu)**2/2/sigma**2), a_min=None, a_max=1)
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
kde = stats.gaussian_kde(u)

# Estimates: mu sigma A
estimates= [0.1, 1, 3, 
            0.9, 1, 1]
params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u),estimates )

# ---------------------------------------------------------------------------
# RuntimeError                              Traceback (most recent call last)
# <ipython-input-265-b2efb2ca0e0a> in <module>()
#      32 estimates= [0.1, 1, 3, 
#      33             0.9, 1, 1]
# ---> 34 params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u),estimates )

# /Users/mu/anaconda/lib/python3.6/site-packages/scipy/optimize/minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
#     738         cost = np.sum(infodict['fvec'] ** 2)
#     739         if ier not in [1, 2, 3, 4]:
# --> 740             raise RuntimeError("Optimal parameters not found: " + errmsg)
#     741     else:
#     742         # Rename maxfev (leastsq) to max_nfev (least_squares), if specified.

# RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 1400.

In response to @Uvar 's very helpful explanation below. I am trying to test the integral from 0 - 1 to see if it equals 1 but I'm getting 0.3. I think I'm missing a crucial step in logic:

# KDE Model
def truncated_gaussian(x,mu,sigma,A):
    return A*np.exp(-(x-mu)**2/2/sigma**2)

def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    if type(x) == np.ndarray:
        norm_probas = truncated_gaussian(x,mu1,sigma1,A1) + truncated_gaussian(x,mu2,sigma2,A2)
        mask_lower = x < 0
        mask_upper = x > 1
        mask_floor = (mask_lower.astype(int) + mask_upper.astype(int)) > 1
        norm_probas[mask_floor] = 0
        return norm_probas
    else:
        if (x < 0) or (x > 1):
            return 0
        return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)

kde = stats.gaussian_kde(u, bw_method=2e-2)

# # Estimates: mu sigma A
estimates= [0.1, 1, 3, 
            0.9, 1, 1]
params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u)/integrate.quad(kde, 0 , 1)[0],estimates ,maxfev=5000)
# params
# array([  9.89751700e-01,   1.92831695e-02,   7.84324114e+00,
#          3.73623345e-03,   1.07754038e-02,   3.79238972e+01])

# Test the integral from 0 - 1
x = np.linspace(0,1,1000)
with plt.style.context("seaborn-white"):
    fig, ax = plt.subplots()
    ax.plot(x, kde(x), color="black", label="Data")
    ax.plot(x, mixture_model(x, *params), color="red", label="Model")
    ax.legend()
# Integrating from 0 to 1
integrate.quad(lambda x: mixture_model(x, *params), 0,1)[0]
# 0.3026863969781809

enter image description here

like image 681
O.rka Avatar asked May 02 '26 15:05

O.rka


1 Answers

It seems you are misspecifying the fitting procedure. You are trying to fit the kde.pdf(u) while constraining half-bounds.

foo = kde.pdf(u)

min(foo)
Out[329]: 0.22903365654960098

max(foo)
Out[330]: 4.0119283429320332

As you can see, the probability density function of u is not constrained to [0,1]. As such, just deleting the clipping action, will result in an exact fit.

def truncated_gaussian_lower(x,mu,sigma,A):
    return A*np.exp((-(x-mu)**2)/(2*sigma**2))
def truncated_gaussian_upper(x,mu,sigma,A):
    return A * np.exp((-(x-mu)**2)/(2*sigma**2))
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)

estimates= [0.15, 1, 3, 
            0.95, 1, 1]
params,cov= optimize.curve_fit(f=mixture_model, xdata=u, ydata=kde.pdf(u), p0=estimates)

params
Out[327]: 
array([ 0.00672248,  0.07462657,  4.01188383,  0.98006841,  0.07654998,
        1.30569665])

y3 = mixture_model(u, params[0], params[1], params[2], params[3], params[4], params[5])

plt.plot(kde.pdf(u)+0.1) #add offset for visual inspection purpose

plt.plot(y3)

Perfect overlap, made visible by the +0.1 offset

So, let's now say I change what I am plotting to:

plt.figure(); plt.plot(u,y3,'.')

Which does look like what you are trying to achieve

Because, indeed:

np.allclose(y3, kde(u), atol=1e-2)
>>True

You can edit the mixture model a bit to be 0 outside of the domain [0, 1]:

def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    if (x < 0) or (x > 1):
        return 0
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)

Doing so, however, will lose the option of instantly evaluating the function over an array of x.. So for the sake of argument, I will leave it out for now.

Anyway, we want our integral to sum up to 1 in the domain [0, 1], and one way to do this (feel free to play around with the bandwidth estimator in stats.gaussian_kde as well..) is to divide the probability density estimate by its integral over the domain. Take care that optimize.curve_fit only takes 1400 iterations in this implementation, so the initial parameter estimates matter.

from scipy import integrate
sum_prob = integrate.quad(kde, 0 , 1)[0]
y = kde(u)/sum_prob
# Estimates: mu sigma A
estimates= [0.15, 1, 5, 
            0.95, 0.5, 3]
params,cov= optimize.curve_fit(f=mixture_model, xdata=u, ydata=y, p0=estimates)
>>array([  6.72247814e-03,   7.46265651e-02,   7.23699661e+00,
     9.80068414e-01,   7.65499825e-02,   2.35533297e+00])

y3 = mixture_model(np.arange(0,1,0.001), params[0], params[1], params[2], 
    params[3], params[4], params[5])

with plt.style.context("seaborn-white"):
    fig, ax = plt.subplots()
    sns.kdeplot(u, color="black", ax=ax)
    ax.axvline(0, linestyle=":", color="red")
    ax.axvline(1, linestyle=":", color="red")
    plt.plot(np.arange(0,1,0.001), y3) #The red line is now your custom pdf with area-under-curve = 0.998 in the domain..

Total plot

To check for the area under the curve, I used this hacky solution of redefining mixture_model..:

def mixture_model(x):
    mu1=params[0]; sigma1=params[1]; A1=params[2]; mu2=params[3]; sigma2=params[4]; A2=params[5]
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)

from scipy import integrate
integrated_value, error = integrate.quad(mixture_model, 0, 1) #0 lower bound, 1 upper bound
>>(0.9978588016186962, 5.222293368393178e-14)

Or doing the integral a second way:

import sympy
x = sympy.symbols('x', real=True, nonnegative=True)
foo = sympy.integrate(params[2]*sympy.exp((-(x-params[0])**2)/(2*params[1]**2))+params[5]*sympy.exp((-(x-params[3])**2)/(2*params[4]**2)),(x,0,1), manual=True)
foo.doit()

>>0.562981541724715*sqrt(pi) #this evaluates to 0.9978588016186956

And actually doing it your way as described in your edited question:

def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
integrate.quad(lambda x: mixture_model(x, *params), 0,1)[0]
>>0.9978588016186962

If I set my bandwidth to your level (2e-2), indeed the evaluation comes down to 0.92, which is a worse result than the 0.998 we had earlier, but that is still significantly different from the 0.3 you report which is something I cannot recreate, even while copying your code snippets. Do you perhaps accidentally redefine functions/variables somewhere?

like image 105
Uvar Avatar answered May 05 '26 06:05

Uvar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!