Find distance between centroid and points in a single feature dataframe - KMeans

Question

I'm working on an anomaly detection task using KMeans.
Pandas dataframe that i'm using has a single feature and it is like the following one:

df = array([[12534.],
           [12014.],
           [12158.],
           [11935.],
           ...,
           [ 5120.],
           [ 4828.],
           [ 4443.]])

I'm able to fit and to predict values with the following instructions:

km = KMeans(n_clusters=2)
km.fit(df)
km.predict(df)

In order to identify anomalies, I would like to calculate the distance between centroid and each single point, but with a dataframe with a single feature i'm not sure that it is the correct approach.

I found examples which used euclidean distance to calculate the distance. An example is the following one:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
    distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
    return distances

centroids = self.km.cluster_centers_
distances = []
for i, (cx, cy) in enumerate(centroids):
    mean_distance = k_mean_distance(day_df, cx, cy, i, clusters)
    distances.append({'x': cx, 'y': cy, 'distance': mean_distance})

This code doesn't work for me because centroids are like the following one in my case, since i have a single feature dataframe:

array([[11899.90692187],
       [ 5406.54143126]])

In this case, what is the correct approach to find the distance between centroid and points? Is it possible?

Thank you and sorry for the trivial question, i'm still learning

Quang Hoang · Accepted Answer

There's scipy.spatial.distance_matrix you can make use of:

# setup a set of 2d points
np.random.seed(2)
df = np.random.uniform(0,1,(100,2))

# make it a dataframe
df = pd.DataFrame(df)

# clustering with 3 clusters
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(df)
preds = km.predict(df)

# get centroids
centroids = km.cluster_centers_

# visualize
plt.scatter(df[0], df[1], c=preds)
plt.scatter(centroids[:,0], centroids[:,1], c=range(centroids.shape[0]), s=1000)

gives

enter image description here

Now the distance matrix:

from scipy.spatial import distance_matrix

dist_mat = pd.DataFrame(distance_matrix(df.values, centroids))

You can confirm that this is correct by

dist_mat.idxmin(axis=1) == preds

And finally, the mean distance to centroids:

dist_mat.groupby(preds).mean()

gives:

          0         1         2
0  0.243367  0.525194  0.571674
1  0.525350  0.228947  0.575169
2  0.560297  0.573860  0.197556

where the columns denote the centroid number and rows denote the mean distance of the points in a cluster.

Find distance between centroid and points in a single feature dataframe - KMeans

Tags:

python

python-3.x

pandas

machine-learning

k-means

Giordano

1 Answers

Quang Hoang

Recent Activity

Donate For Us

Find distance between centroid and points in a single feature dataframe - KMeans

Tags:

python

python-3.x

pandas

machine-learning

k-means

Giordano

1 Answers

Quang Hoang

Related questions

Recent Activity

Donate For Us