Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

libraries for regression clustering in python?

Recently I've found interesting article about regression clustering algorithm which can deal both tasks of regression and clustering:

http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Regression_Clustering.pdf

I'm just curios-is there some technics (libraries) to do it via Python? Thanks!

like image 375
Keithx Avatar asked Aug 29 '16 14:08

Keithx


Video Answer


1 Answers

The algorithm of Spath is not implemented in Python, as far as I know.

But you could replicate its results using Gaussian mixture models in scikit-learn:

import numpy as np
from sklearn.mixture import GaussianMixture 
import matplotlib.pyplot as plt
# generate random data
np.random.seed(1)
n = 10
x1 = np.random.uniform(0, 20, size=n)
x2 = np.random.uniform(0, 20, size=n)
y1 = x1 + np.random.normal(size=n)
y2 = 15 - x2 + np.random.normal(size=n)
x = np.concatenate([x1, x2])
y = np.concatenate([y1, y2])
data = np.vstack([x, y]).T
model = GaussianMixture (n_components=2).fit(data)
plt.scatter(x, y, c=model.predict(data))
plt.show()

This code produces the picture, similar to one in the paper:

enter image description here

The GMM is different from Spath algorithm, because the former tries to maximize prediction accuracy of ALL data (X and y), and the latter maximizes only R^2 of y. In my opinion, for most practical problems you would prefer the GMM.

If you still want the Spath algorithm, it could be done with a class like this, implementing a version of EM algorithm:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.base import RegressorMixin, BaseEstimator, clone

class ClusteredRegressor(RegressorMixin, BaseEstimator):
    def __init__(self, n_components=2, base=Ridge(), random_state=1, max_iter=100, tol=1e-10, verbose=False):
        self.n_components = n_components
        self.base = base
        self.random_state = random_state
        self.max_iter = max_iter
        self.tol = tol
        self.verbose = verbose

    def fit(self, X, y):
        np.random.seed(self.random_state)
        self.estimators_ = [clone(self.base) for i in range(self.n_components)]
        # initialize cluster responsibilities randomly
        self.resp_ = np.random.uniform(size=(X.shape[0], self.n_components))
        self.resp_ /= self.resp_.sum(axis=1, keepdims=True)
        for it in range(self.max_iter):
            old_resp = self.resp_.copy()
            # Estimate sample-weithted regressions
            errors = np.empty(shape=self.resp_.shape)
            for i, est in enumerate(self.estimators_):
                est.fit(X, y, sample_weight=self.resp_[:, i])
                errors[:, i] = y - est.predict(X)
            self.mse_ = np.sum(self.resp_ * errors**2) / X.shape[0]
            if self.verbose:
                print(self.mse_)
            # Recalculate responsibilities
            self.resp_ = np.exp(-errors**2 / self.mse_)
            self.resp_ /= self.resp_.sum(axis=1, keepdims=True)
            # stop if change in responsibilites is small
            delta = np.abs(self.resp_ - old_resp).mean()
            if delta < self.tol:
                break
        self.n_iter_ = it
        return self

    def predict(self, X):
        """ Calculate a matrix of conditional predictions """
        return np.vstack([est.predict(X) for est in self.estimators_]).T

    def predict_proba(self, X, y):
        """ Estimate cluster probabilities of labeled data """
        predictions = self.predict(X)
        errors = np.empty(shape=self.resp_.shape)
        for i, est in enumerate(self.estimators_):
            errors[:, i] = y - est.predict(X)
        resp_ = np.exp(-errors**2 / self.mse_)
        resp_ /= resp_.sum(axis=1, keepdims=True)
        return resp_

This code is similar to Spath algorithm, with the only difference that it uses soft "responsibilities" of each cluster for each observation, instead of hard cluster assignment (this way, it is easier for optimization). You can see that the resulting cluster assignment is similar to GMM:

model = ClusteredRegressor()
model.fit(x[:, np.newaxis], y)
labels = np.argmax(model.resp_, axis=1)
plt.scatter(x, y, c=labels)
plt.show()

enter image description here

Unfortunately, this model cannot be applied to predict test data, because its output depends on data labels (y). However, if you further modify my code, you could predict cluster probability conditional on X. In this case, the model would be useful for prediction.

like image 149
David Dale Avatar answered Oct 03 '22 05:10

David Dale