Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find distance between centroid and points in a single feature dataframe - KMeans

I'm working on an anomaly detection task using KMeans.
Pandas dataframe that i'm using has a single feature and it is like the following one:

df = array([[12534.],
           [12014.],
           [12158.],
           [11935.],
           ...,
           [ 5120.],
           [ 4828.],
           [ 4443.]])

I'm able to fit and to predict values with the following instructions:

km = KMeans(n_clusters=2)
km.fit(df)
km.predict(df)

In order to identify anomalies, I would like to calculate the distance between centroid and each single point, but with a dataframe with a single feature i'm not sure that it is the correct approach.

I found examples which used euclidean distance to calculate the distance. An example is the following one:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
    distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
    return distances

centroids = self.km.cluster_centers_
distances = []
for i, (cx, cy) in enumerate(centroids):
    mean_distance = k_mean_distance(day_df, cx, cy, i, clusters)
    distances.append({'x': cx, 'y': cy, 'distance': mean_distance})

This code doesn't work for me because centroids are like the following one in my case, since i have a single feature dataframe:

array([[11899.90692187],
       [ 5406.54143126]])

In this case, what is the correct approach to find the distance between centroid and points? Is it possible?

Thank you and sorry for the trivial question, i'm still learning

like image 785
Giordano Avatar asked Oct 16 '25 18:10

Giordano


1 Answers

There's scipy.spatial.distance_matrix you can make use of:

# setup a set of 2d points
np.random.seed(2)
df = np.random.uniform(0,1,(100,2))

# make it a dataframe
df = pd.DataFrame(df)

# clustering with 3 clusters
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(df)
preds = km.predict(df)

# get centroids
centroids = km.cluster_centers_

# visualize
plt.scatter(df[0], df[1], c=preds)
plt.scatter(centroids[:,0], centroids[:,1], c=range(centroids.shape[0]), s=1000)

gives

enter image description here

Now the distance matrix:

from scipy.spatial import distance_matrix

dist_mat = pd.DataFrame(distance_matrix(df.values, centroids))

You can confirm that this is correct by

dist_mat.idxmin(axis=1) == preds

And finally, the mean distance to centroids:

dist_mat.groupby(preds).mean()

gives:

          0         1         2
0  0.243367  0.525194  0.571674
1  0.525350  0.228947  0.575169
2  0.560297  0.573860  0.197556

where the columns denote the centroid number and rows denote the mean distance of the points in a cluster.

like image 53
Quang Hoang Avatar answered Oct 18 '25 07:10

Quang Hoang



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!