Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can implement EM-GMM in python?

I have implemented EM algorithm for GMM using this post GMMs and Maximum Likelihood Optimization Using NumPy unsuccessfully as follows:

import numpy as np

def PDF(data, means, variances):
    return 1/(np.sqrt(2 * np.pi * variances) + eps) * np.exp(-1/2 * (np.square(data - means) / (variances + eps)))

def EM_GMM(data, k, iterations):
    weights = np.ones((k, 1)) / k # shape=(k, 1)
    means = np.random.choice(data, k)[:, np.newaxis] # shape=(k, 1)
    variances = np.random.random_sample(size=k)[:, np.newaxis] # shape=(k, 1)

    data = np.repeat(data[np.newaxis, :], k, 0) # shape=(k, n)

    for step in range(iterations):
        # Expectation step
        likelihood = PDF(data, means, np.sqrt(variances)) # shape=(k, n)

        # Maximization step
        b = likelihood * weights # shape=(k, n)
        b /= np.sum(b, axis=1)[:, np.newaxis] + eps

        # updage means, variances, and weights
        means = np.sum(b * data, axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps)
        variances = np.sum(b * np.square(data - means), axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps)
        weights = np.mean(b, axis=1)[:, np.newaxis]
        
    return means, variances

when I run the algorithm on a 1-D time-series dataset, for k equal to 3, it returns an output like the following:

array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    3.05053810e-003, 2.36989898e-025, 2.36989898e-025,
    1.32797395e-136, 6.91134950e-031, 5.47347807e-001,
    1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
    1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
    1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
    1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
    1.44637007e+000, 2.25849208e-064, 0.00000000e+000,
    1.61228562e-303, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 3.94387272e-242,
    1.13078186e+000, 2.53108878e-001, 5.33548114e-001,
    9.14920432e-001, 2.07015697e-013, 4.45250680e-038,
    1.43000602e+000, 1.28781615e+000, 1.44821615e+000,
    1.18186109e+000, 3.21610659e-002, 3.21610659e-002,
    3.21610659e-002, 3.21610659e-002, 3.21610659e-002,
    2.47382844e-039, 0.00000000e+000, 2.09150855e-200,
    0.00000000e+000, 0.00000000e+000],
   [5.93203066e-002, 1.01647068e+000, 5.99299162e-001,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 2.14690238e-010,
    2.49337135e-191, 5.10499986e-001, 9.32658804e-001,
    1.21148135e+000, 1.13315278e+000, 2.50324069e-237,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 1.73966953e-125, 2.53559290e-275,
    1.42960975e-065, 7.57552338e-001],
   [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
    3.05053810e-003, 2.36989898e-025, 2.36989898e-025,
    1.32797395e-136, 6.91134950e-031, 5.47347807e-001,
    1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
    1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
    1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
    1.44637007e+000, 1.44637007e+000, 1.44637007e+000,
    1.44637007e+000, 2.25849208e-064, 0.00000000e+000,
    1.61228562e-303, 0.00000000e+000, 0.00000000e+000,
    0.00000000e+000, 0.00000000e+000, 3.94387272e-242,
    1.13078186e+000, 2.53108878e-001, 5.33548114e-001,
    9.14920432e-001, 2.07015697e-013, 4.45250680e-038,
    1.43000602e+000, 1.28781615e+000, 1.44821615e+000,
    1.18186109e+000, 3.21610659e-002, 3.21610659e-002,
    3.21610659e-002, 3.21610659e-002, 3.21610659e-002,
    2.47382844e-039, 0.00000000e+000, 2.09150855e-200,
    0.00000000e+000, 0.00000000e+000]])

which I believe is working wrong since the outputs are two vectors which one of them represents means values and the other one represents variances values. The vague point which made me doubtful about implementation is it returns back 0.00000000e+000 for most of the outputs as it can be seen and it doesn't need really to visualize these outputs. BTW the input data are time-series data. I have checked everything and traced multiple times but no bug shows up.

Here are my input data:

[25.31      , 24.31      , 24.12      , 43.46      , 41.48666667,
   41.48666667, 37.54      , 41.175     , 44.81      , 44.44571429,
   44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429,
   44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429,
   44.44571429, 44.44571429, 39.71      , 26.69      , 34.15      ,
   24.94      , 24.75      , 24.56      , 24.38      , 35.25      ,
   44.62      , 44.94      , 44.815     , 44.69      , 42.31      ,
   40.81      , 44.38      , 44.56      , 44.44      , 44.25      ,
   43.66666667, 43.66666667, 43.66666667, 43.66666667, 43.66666667,
   40.75      , 32.31      , 36.08      , 30.135     , 24.19      ]

I was wondering if there is an elegant way to implement it via numpy or SciKit-learn. Any helps will be appreciated.

Update Following is current output and expected output: img

like image 392
Mario Avatar asked Jun 13 '26 20:06

Mario


1 Answers

As I mentioned in the comment, the critical point that I see is the means initialization. Following the default implementation of sklearn Gaussian Mixture, instead of random initialization, I switched to KMeans.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn')

eps=1e-8 

def PDF(data, means, variances):
    return 1/(np.sqrt(2 * np.pi * variances) + eps) * np.exp(-1/2 * (np.square(data - means) / (variances + eps)))

def EM_GMM(data, k=3, iterations=100, init_strategy='kmeans'):
    weights = np.ones((k, 1)) / k # shape=(k, 1)
    
    if init_strategy=='kmeans':
        from sklearn.cluster import KMeans
        
        km = KMeans(k).fit(data[:, None])
        means = km.cluster_centers_ # shape=(k, 1)
        
    else: # init_strategy=='random'
        means = np.random.choice(data, k)[:, np.newaxis] # shape=(k, 1)
    
    variances = np.random.random_sample(size=k)[:, np.newaxis] # shape=(k, 1)

    data = np.repeat(data[np.newaxis, :], k, 0) # shape=(k, n)

    for step in range(iterations):
        # Expectation step
        likelihood = PDF(data, means, np.sqrt(variances)) # shape=(k, n)

        # Maximization step
        b = likelihood * weights # shape=(k, n)
        b /= np.sum(b, axis=1)[:, np.newaxis] + eps

        # updage means, variances, and weights
        means = np.sum(b * data, axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps)
        variances = np.sum(b * np.square(data - means), axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps)
        weights = np.mean(b, axis=1)[:, np.newaxis]
        
    return means, variances

This seems to yield the desired output much more consistently:

s = np.array([25.31      , 24.31      , 24.12      , 43.46      , 41.48666667,
              41.48666667, 37.54      , 41.175     , 44.81      , 44.44571429,
              44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429,
              44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429,
              44.44571429, 44.44571429, 39.71      , 26.69      , 34.15      ,
              24.94      , 24.75      , 24.56      , 24.38      , 35.25      ,
              44.62      , 44.94      , 44.815     , 44.69      , 42.31      ,
              40.81      , 44.38      , 44.56      , 44.44      , 44.25      ,
              43.66666667, 43.66666667, 43.66666667, 43.66666667, 43.66666667,
              40.75      , 32.31      , 36.08      , 30.135     , 24.19      ])
k=3
n_iter=100

means, variances = EM_GMM(s, k, n_iter)
print(means,variances)
[[44.42596231]
 [24.509301  ]
 [35.4137508 ]] 
[[0.07568723]
 [0.10583743]
 [0.52125856]]

# Plotting the results
colors = ['green', 'red', 'blue', 'yellow']
bins = np.linspace(np.min(s)-2, np.max(s)+2, 100)

plt.figure(figsize=(10,7))
plt.xlabel('$x$')
plt.ylabel('pdf')

sns.scatterplot(s, [0.05] * len(s), color='navy', s=40, marker=2, label='Series data')

for i, (m, v) in enumerate(zip(means, variances)):
    sns.lineplot(bins, PDF(bins, m, v), color=colors[i], label=f'Cluster {i+1}')

plt.legend()
plt.plot()

EM_result

Finally we can see that the purely random initialization generates different results; let's see the resulting means:

for _ in range(5):
    print(EM_GMM(s, k, n_iter, init_strategy='random')[0], '\n')

[[44.42596231]
 [44.42596231]
 [44.42596231]]

[[44.42596231]
 [24.509301  ]
 [30.1349997 ]]

[[44.42596231]
 [35.4137508 ]
 [44.42596231]]

[[44.42596231]
 [30.1349997 ]
 [44.42596231]]

[[44.42596231]
 [44.42596231]
 [44.42596231]]

One can see how different these results are, in some cases the resulting means is constant, meaning that inizalization chose 3 similar values and didn't change much while iterating. Adding some print statements inside the EM_GMM will clarify that.

like image 53
FBruzzesi Avatar answered Jun 15 '26 11:06

FBruzzesi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!