Fit mixture of Gaussians with fixed covariance in Python

Tags:

I have some 2D data (GPS data) with clusters (stop locations) that I know resemble Gaussians with a characteristic standard deviation (proportional to the inherent noise of GPS samples). The figure below visualizes a sample that I expect has two such clusters. The image is 25 meters wide and 13 meters tall.

enter image description here

The sklearn module has a function sklearn.mixture.GaussianMixture which allows you to fit a mixture of Gaussians to data. The function has a parameter, covariance_type, that enables you to assume different things about the shape of the Gaussians. You can, for example, assume them to be uniform using the 'tied' argument.

However, it does not appear directly possible to assume the covariance matrices to remain constant. From the sklearn source code it seems trivial to make a modification that enables this but it feels a bit excessive to make a pull request with an update that allows this (also I don't want to accidentally add bugs in sklearn). Is there a better way to fit a mixture to data where the covariance matrix of each Gaussian is fixed?

I want to assume that the SD should remain constant at around 3 meters for each component, since that is roughly the noise level of my GPS samples.

207

asked Jan 29 '18 13:01

Ulf Aslak

1 Answers

It is simple enough to write your own implementation of EM algorithm. It would also give you a good intuition of the process. I assume that covariance is known and that prior probabilities of components are equal, and fit only means.

The class would look like this (in Python 3):

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

class FixedCovMixture:
    """ The model to estimate gaussian mixture with fixed covariance matrix. """
    def __init__(self, n_components, cov, max_iter=100, random_state=None, tol=1e-10):
        self.n_components = n_components
        self.cov = cov
        self.random_state = random_state
        self.max_iter = max_iter
        self.tol=tol

    def fit(self, X):
        # initialize the process:
        np.random.seed(self.random_state)
        n_obs, n_features = X.shape
        self.mean_ = X[np.random.choice(n_obs, size=self.n_components)]
        # make EM loop until convergence
        i = 0
        for i in range(self.max_iter):
            new_centers = self.updated_centers(X)
            if np.sum(np.abs(new_centers-self.mean_)) < self.tol:
                break
            else:
                self.mean_ = new_centers
        self.n_iter_ = i

    def updated_centers(self, X):
        """ A single iteration """
        # E-step: estimate probability of each cluster given cluster centers
        cluster_posterior = self.predict_proba(X)
        # M-step: update cluster centers as weighted average of observations
        weights = (cluster_posterior.T / cluster_posterior.sum(axis=1)).T
        new_centers = np.dot(weights, X)
        return new_centers


    def predict_proba(self, X):
        likelihood = np.stack([multivariate_normal.pdf(X, mean=center, cov=self.cov) 
                               for center in self.mean_])
        cluster_posterior = (likelihood / likelihood.sum(axis=0))
        return cluster_posterior

    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=0)

On the data like yours, the model would converge quickly:

np.random.seed(1)
X = np.random.normal(size=(100,2), scale=3)
X[50:] += (10, 5)

model = FixedCovMixture(2, cov=[[3,0],[0,3]], random_state=1)
model.fit(X)
print(model.n_iter_, 'iterations')
print(model.mean_)

plt.scatter(X[:,0], X[:,1], s=10, c=model.predict(X))
plt.scatter(model.mean_[:,0], model.mean_[:,1], s=100, c='k')
plt.axis('equal')
plt.show();

and output

11 iterations
[[9.92301067 4.62282807]
 [0.09413883 0.03527411]]

You can see that the estimated centers ((9.9, 4.6) and (0.09, 0.03)) are close to the true centers ((10, 5) and (0, 0)).

enter image description here

161

answered Oct 06 '22 14:10

David Dale

Related questions
                            
                                Multiply scipy.lti transfer functions
                            
                                Fix Conflicting migrations detected in Django1.9
                            
                                Repeating values in a "group by" pandas dataframe
                            
                                py.test session level fixtures in setup_method
                            
                                TypeError: decoding str is not supported
                            
                                How to override Gunicorn's logging config to use a custom formatter
                            
                                import matplotlib failing with No module named _tkinter on heroku
                            
                                How to split a numpy array in fixed size chunks with and without overlap?
                            
                                Python: Access embedded OLE from Office/Excel document without clipboard
                            
                                About tensorflow Metadata and RunOptions
                            
                                imp module is deprecated in favour of importlib
                            
                                TensorFlow Dataset Shuffle Each Epoch
                            
                                Parse a string of multipart data
                            
                                Why does unpacking this map object print "must be an iterable, not map"?
                            
                                How to use a button to trigger callback updates?
                            
                                How does numpy.reshape() with order = 'F' work?
                            
                                Weighted mse custom loss function in keras
                            
                                Training broke with ResourceExausted error
                            
                                Saving high-resolution images with plotnine
                            
                                Save pandas dataframe with numpy arrays column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fit mixture of Gaussians with fixed covariance in Python

Tags:

python

machine-learning

scikit-learn

gmm

Ulf Aslak

People also ask

1 Answers

David Dale

Recent Activity

Donate For Us