I'm working on an anomaly detection task using KMeans.
Pandas dataframe that i'm using has a single feature and it is like the following one:
df = array([[12534.],
[12014.],
[12158.],
[11935.],
...,
[ 5120.],
[ 4828.],
[ 4443.]])
I'm able to fit and to predict values with the following instructions:
km = KMeans(n_clusters=2)
km.fit(df)
km.predict(df)
In order to identify anomalies, I would like to calculate the distance between centroid and each single point, but with a dataframe with a single feature i'm not sure that it is the correct approach.
I found examples which used euclidean distance to calculate the distance. An example is the following one:
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return distances
centroids = self.km.cluster_centers_
distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(day_df, cx, cy, i, clusters)
distances.append({'x': cx, 'y': cy, 'distance': mean_distance})
This code doesn't work for me because centroids are like the following one in my case, since i have a single feature dataframe:
array([[11899.90692187],
[ 5406.54143126]])
In this case, what is the correct approach to find the distance between centroid and points? Is it possible?
Thank you and sorry for the trivial question, i'm still learning
There's scipy.spatial.distance_matrix
you can make use of:
# setup a set of 2d points
np.random.seed(2)
df = np.random.uniform(0,1,(100,2))
# make it a dataframe
df = pd.DataFrame(df)
# clustering with 3 clusters
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(df)
preds = km.predict(df)
# get centroids
centroids = km.cluster_centers_
# visualize
plt.scatter(df[0], df[1], c=preds)
plt.scatter(centroids[:,0], centroids[:,1], c=range(centroids.shape[0]), s=1000)
gives
Now the distance matrix:
from scipy.spatial import distance_matrix
dist_mat = pd.DataFrame(distance_matrix(df.values, centroids))
You can confirm that this is correct by
dist_mat.idxmin(axis=1) == preds
And finally, the mean distance to centroids:
dist_mat.groupby(preds).mean()
gives:
0 1 2
0 0.243367 0.525194 0.571674
1 0.525350 0.228947 0.575169
2 0.560297 0.573860 0.197556
where the columns denote the centroid number and rows denote the mean distance of the points in a cluster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With