How can i find the mean distance from the centroid to all the data points in each cluster. I am able to find the euclidean distance of each point (in my dataset) from the centroid of each cluster. Now i want to find the mean distance from centroid to all the data points in each cluster. What is a good way of calculating mean distance from each centroid ? So far I have done this..
def k_means(self):
data = pd.read_csv('hdl_gps_APPLE_20111220_130416.csv', delimiter=',')
combined_data = data.iloc[0:, 0:4].dropna()
#print combined_data
array_convt = combined_data.values
#print array_convt
combined_data.head()
t_data=PCA(n_components=2).fit_transform(array_convt)
#print t_data
k_means=KMeans()
k_means.fit(t_data)
#------------k means fit predict method for testing purpose-----------------
clusters=k_means.fit_predict(t_data)
#print clusters.shape
cluster_0=np.where(clusters==0)
print cluster_0
X_cluster_0 = t_data[cluster_0]
#print X_cluster_0
distance = euclidean(X_cluster_0[0], k_means.cluster_centers_[0])
print distance
classified_data = k_means.labels_
#print ('all rows forst column........')
x_min = t_data[:, 0].min() - 5
x_max = t_data[:, 0].max() - 1
#print ('min is ')
#print x_min
#print ('max is ')
#print x_max
df_processed = data.copy()
df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)
#print df_processed
y_min, y_max = t_data[:, 1].min(), t_data[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 1), np.arange(y_min, y_max, 1))
#print ('the mesh grid is: ')
#print xx
Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
#print Z
plt.plot(t_data[:, 0], t_data[:, 1], 'k.', markersize=20)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
In short I want to calculate mean distance of all the data points in particular cluster from the centroid of that cluster as I need to clean my data on the basis of this mean distance
Choosing K One method of choosing value K is the elbow method. In this method we will run K-Means clustering for a range of K values lets say ( K= 1 to 10 ) and calculate the Sum of Squared Error (SSE). SSE is calculated as the mean distance between data points and their cluster centroid.
The centroid distance between cluster A and B is simply the distance between centroid(A) and centroid(B). The average distance is calculated by finding the average pairwise distance between the points in each cluster.
In Average linkage clustering, the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group. D(r,s) = Trs / ( Nr * Ns) Where Trs is the sum of all pairwise distances between cluster r and cluster s.
Here's one way. You can substitute another distance measure in the function for k_mean_distance()
if you want another distance metric other than Euclidean.
Calculate distance between data points for each assigned cluster and cluster centers and return the mean value.
Function for distance calculation:
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
# Calculate Euclidean distance for each data point assigned to centroid
distances = [np.sqrt((x-cx)**2+(y-cy)**2) for (x, y) in data[cluster_labels == i_centroid]]
# return the mean value
return np.mean(distances)
And for each centroid, use the function to get the mean distance:
total_distance = []
for i, (cx, cy) in enumerate(centroids):
# Function from above
mean_distance = k_mean_distance(data, cx, cy, i, cluster_labels)
total_dist.append(mean_distance)
So, in the context of your question:
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x-cx)**2+(y-cy)**2) for (x, y) in data[cluster_labels == i_centroid]]
return np.mean(distances)
t_data=PCA(n_components=2).fit_transform(array_convt)
k_means=KMeans()
clusters=k_means.fit_predict(t_data)
centroids = km.cluster_centers_
c_mean_distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(t_data, cx, cy, i, clusters)
c_mean_distances.append(mean_distance)
If you plot the results plt.plot(c_mean_distances)
you should see something like this:
You can use following Attribute of KMeans:
cluster_centers_ : array, [n_clusters, n_features]
For every point, test to what cluster it belongs using predict(X)
and after that calculate distance to cluster predict returns(it returns index).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With