Recently I've found interesting article about regression clustering algorithm which can deal both tasks of regression and clustering:
http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Regression_Clustering.pdf
I'm just curios-is there some technics (libraries) to do it via Python? Thanks!
The algorithm of Spath is not implemented in Python, as far as I know.
But you could replicate its results using Gaussian mixture models in scikit-learn:
import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
# generate random data
np.random.seed(1)
n = 10
x1 = np.random.uniform(0, 20, size=n)
x2 = np.random.uniform(0, 20, size=n)
y1 = x1 + np.random.normal(size=n)
y2 = 15 - x2 + np.random.normal(size=n)
x = np.concatenate([x1, x2])
y = np.concatenate([y1, y2])
data = np.vstack([x, y]).T
model = GaussianMixture (n_components=2).fit(data)
plt.scatter(x, y, c=model.predict(data))
plt.show()
This code produces the picture, similar to one in the paper:
The GMM is different from Spath algorithm, because the former tries to maximize prediction accuracy of ALL data (X and y), and the latter maximizes only R^2 of y. In my opinion, for most practical problems you would prefer the GMM.
If you still want the Spath algorithm, it could be done with a class like this, implementing a version of EM algorithm:
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.base import RegressorMixin, BaseEstimator, clone
class ClusteredRegressor(RegressorMixin, BaseEstimator):
def __init__(self, n_components=2, base=Ridge(), random_state=1, max_iter=100, tol=1e-10, verbose=False):
self.n_components = n_components
self.base = base
self.random_state = random_state
self.max_iter = max_iter
self.tol = tol
self.verbose = verbose
def fit(self, X, y):
np.random.seed(self.random_state)
self.estimators_ = [clone(self.base) for i in range(self.n_components)]
# initialize cluster responsibilities randomly
self.resp_ = np.random.uniform(size=(X.shape[0], self.n_components))
self.resp_ /= self.resp_.sum(axis=1, keepdims=True)
for it in range(self.max_iter):
old_resp = self.resp_.copy()
# Estimate sample-weithted regressions
errors = np.empty(shape=self.resp_.shape)
for i, est in enumerate(self.estimators_):
est.fit(X, y, sample_weight=self.resp_[:, i])
errors[:, i] = y - est.predict(X)
self.mse_ = np.sum(self.resp_ * errors**2) / X.shape[0]
if self.verbose:
print(self.mse_)
# Recalculate responsibilities
self.resp_ = np.exp(-errors**2 / self.mse_)
self.resp_ /= self.resp_.sum(axis=1, keepdims=True)
# stop if change in responsibilites is small
delta = np.abs(self.resp_ - old_resp).mean()
if delta < self.tol:
break
self.n_iter_ = it
return self
def predict(self, X):
""" Calculate a matrix of conditional predictions """
return np.vstack([est.predict(X) for est in self.estimators_]).T
def predict_proba(self, X, y):
""" Estimate cluster probabilities of labeled data """
predictions = self.predict(X)
errors = np.empty(shape=self.resp_.shape)
for i, est in enumerate(self.estimators_):
errors[:, i] = y - est.predict(X)
resp_ = np.exp(-errors**2 / self.mse_)
resp_ /= resp_.sum(axis=1, keepdims=True)
return resp_
This code is similar to Spath algorithm, with the only difference that it uses soft "responsibilities" of each cluster for each observation, instead of hard cluster assignment (this way, it is easier for optimization). You can see that the resulting cluster assignment is similar to GMM:
model = ClusteredRegressor()
model.fit(x[:, np.newaxis], y)
labels = np.argmax(model.resp_, axis=1)
plt.scatter(x, y, c=labels)
plt.show()
Unfortunately, this model cannot be applied to predict test data, because its output depends on data labels (y
). However, if you further modify my code, you could predict cluster probability conditional on X
. In this case, the model would be useful for prediction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With