libraries for regression clustering in python?

Question

Recently I've found interesting article about regression clustering algorithm which can deal both tasks of regression and clustering:

http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Regression_Clustering.pdf

I'm just curios-is there some technics (libraries) to do it via Python? Thanks!

David Dale · Accepted Answer

The algorithm of Spath is not implemented in Python, as far as I know.

But you could replicate its results using Gaussian mixture models in scikit-learn:

import numpy as np
from sklearn.mixture import GaussianMixture 
import matplotlib.pyplot as plt
# generate random data
np.random.seed(1)
n = 10
x1 = np.random.uniform(0, 20, size=n)
x2 = np.random.uniform(0, 20, size=n)
y1 = x1 + np.random.normal(size=n)
y2 = 15 - x2 + np.random.normal(size=n)
x = np.concatenate([x1, x2])
y = np.concatenate([y1, y2])
data = np.vstack([x, y]).T
model = GaussianMixture (n_components=2).fit(data)
plt.scatter(x, y, c=model.predict(data))
plt.show()

This code produces the picture, similar to one in the paper:

enter image description here

The GMM is different from Spath algorithm, because the former tries to maximize prediction accuracy of ALL data (X and y), and the latter maximizes only R^2 of y. In my opinion, for most practical problems you would prefer the GMM.

If you still want the Spath algorithm, it could be done with a class like this, implementing a version of EM algorithm:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.base import RegressorMixin, BaseEstimator, clone

class ClusteredRegressor(RegressorMixin, BaseEstimator):
    def __init__(self, n_components=2, base=Ridge(), random_state=1, max_iter=100, tol=1e-10, verbose=False):
        self.n_components = n_components
        self.base = base
        self.random_state = random_state
        self.max_iter = max_iter
        self.tol = tol
        self.verbose = verbose

    def fit(self, X, y):
        np.random.seed(self.random_state)
        self.estimators_ = [clone(self.base) for i in range(self.n_components)]
        # initialize cluster responsibilities randomly
        self.resp_ = np.random.uniform(size=(X.shape[0], self.n_components))
        self.resp_ /= self.resp_.sum(axis=1, keepdims=True)
        for it in range(self.max_iter):
            old_resp = self.resp_.copy()
            # Estimate sample-weithted regressions
            errors = np.empty(shape=self.resp_.shape)
            for i, est in enumerate(self.estimators_):
                est.fit(X, y, sample_weight=self.resp_[:, i])
                errors[:, i] = y - est.predict(X)
            self.mse_ = np.sum(self.resp_ * errors**2) / X.shape[0]
            if self.verbose:
                print(self.mse_)
            # Recalculate responsibilities
            self.resp_ = np.exp(-errors**2 / self.mse_)
            self.resp_ /= self.resp_.sum(axis=1, keepdims=True)
            # stop if change in responsibilites is small
            delta = np.abs(self.resp_ - old_resp).mean()
            if delta < self.tol:
                break
        self.n_iter_ = it
        return self

    def predict(self, X):
        """ Calculate a matrix of conditional predictions """
        return np.vstack([est.predict(X) for est in self.estimators_]).T

    def predict_proba(self, X, y):
        """ Estimate cluster probabilities of labeled data """
        predictions = self.predict(X)
        errors = np.empty(shape=self.resp_.shape)
        for i, est in enumerate(self.estimators_):
            errors[:, i] = y - est.predict(X)
        resp_ = np.exp(-errors**2 / self.mse_)
        resp_ /= resp_.sum(axis=1, keepdims=True)
        return resp_

This code is similar to Spath algorithm, with the only difference that it uses soft "responsibilities" of each cluster for each observation, instead of hard cluster assignment (this way, it is easier for optimization). You can see that the resulting cluster assignment is similar to GMM:

model = ClusteredRegressor()
model.fit(x[:, np.newaxis], y)
labels = np.argmax(model.resp_, axis=1)
plt.scatter(x, y, c=labels)
plt.show()

enter image description here

Unfortunately, this model cannot be applied to predict test data, because its output depends on data labels (y). However, if you further modify my code, you could predict cluster probability conditional on X. In this case, the model would be useful for prediction.

libraries for regression clustering in python?

Tags:

python

numpy

scipy

scikit-learn

Keithx

Video Answer

1 Answers

David Dale

Recent Activity

Donate For Us

libraries for regression clustering in python?

Tags:

python

numpy

scipy

scikit-learn

Keithx

Video Answer

1 Answers

David Dale

Related questions

Recent Activity

Donate For Us