Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn : Mean Distance from Centroid of each cluster

How can i find the mean distance from the centroid to all the data points in each cluster. I am able to find the euclidean distance of each point (in my dataset) from the centroid of each cluster. Now i want to find the mean distance from centroid to all the data points in each cluster. What is a good way of calculating mean distance from each centroid ? So far I have done this..

def k_means(self):
    data = pd.read_csv('hdl_gps_APPLE_20111220_130416.csv', delimiter=',')
    combined_data = data.iloc[0:, 0:4].dropna()
    #print combined_data
    array_convt = combined_data.values
    #print array_convt
    combined_data.head()


    t_data=PCA(n_components=2).fit_transform(array_convt)
    #print t_data
    k_means=KMeans()
    k_means.fit(t_data)
    #------------k means fit predict method for testing purpose-----------------
    clusters=k_means.fit_predict(t_data)
    #print clusters.shape
    cluster_0=np.where(clusters==0)
    print cluster_0

    X_cluster_0 = t_data[cluster_0]
    #print X_cluster_0


    distance = euclidean(X_cluster_0[0], k_means.cluster_centers_[0])
    print distance


    classified_data = k_means.labels_
    #print ('all rows forst column........')
    x_min = t_data[:, 0].min() - 5
    x_max = t_data[:, 0].max() - 1
    #print ('min is ')
    #print x_min
    #print ('max is ')
    #print x_max

    df_processed = data.copy()
    df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)
    #print df_processed

    y_min, y_max = t_data[:, 1].min(), t_data[:, 1].max() + 5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 1), np.arange(y_min, y_max, 1))

    #print ('the mesh grid is: ')

    #print xx
    Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure(1)
    plt.clf()
    plt.imshow(Z, interpolation='nearest',
               extent=(xx.min(), xx.max(), yy.min(), yy.max()),
               cmap=plt.cm.Paired,
               aspect='auto', origin='lower')


    #print Z


    plt.plot(t_data[:, 0], t_data[:, 1], 'k.', markersize=20)
    centroids = k_means.cluster_centers_
    inert = k_means.inertia_
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=169, linewidths=3,
                color='w', zorder=8)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.show()

In short I want to calculate mean distance of all the data points in particular cluster from the centroid of that cluster as I need to clean my data on the basis of this mean distance

like image 553
Rezwan Avatar asked Nov 27 '16 12:11

Rezwan


People also ask

How do you find the distance from the centroid in k-means clustering?

Choosing K One method of choosing value K is the elbow method. In this method we will run K-Means clustering for a range of K values lets say ( K= 1 to 10 ) and calculate the Sum of Squared Error (SSE). SSE is calculated as the mean distance between data points and their cluster centroid.

How do you find the distance using the centroid?

The centroid distance between cluster A and B is simply the distance between centroid(A) and centroid(B). The average distance is calculated by finding the average pairwise distance between the points in each cluster.

How do you calculate cluster distance?

In Average linkage clustering, the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group. D(r,s) = Trs / ( Nr * Ns) Where Trs is the sum of all pairwise distances between cluster r and cluster s.


2 Answers

Here's one way. You can substitute another distance measure in the function for k_mean_distance() if you want another distance metric other than Euclidean.

Calculate distance between data points for each assigned cluster and cluster centers and return the mean value.

Function for distance calculation:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
    # Calculate Euclidean distance for each data point assigned to centroid 
    distances = [np.sqrt((x-cx)**2+(y-cy)**2) for (x, y) in data[cluster_labels == i_centroid]]
    # return the mean value
    return np.mean(distances)

And for each centroid, use the function to get the mean distance:

total_distance = []
for i, (cx, cy) in enumerate(centroids):
    # Function from above
    mean_distance = k_mean_distance(data, cx, cy, i, cluster_labels)
    total_dist.append(mean_distance)

So, in the context of your question:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
        distances = [np.sqrt((x-cx)**2+(y-cy)**2) for (x, y) in data[cluster_labels == i_centroid]]
        return np.mean(distances)

t_data=PCA(n_components=2).fit_transform(array_convt)
k_means=KMeans()
clusters=k_means.fit_predict(t_data)
centroids = km.cluster_centers_

c_mean_distances = []
for i, (cx, cy) in enumerate(centroids):
    mean_distance = k_mean_distance(t_data, cx, cy, i, clusters)
    c_mean_distances.append(mean_distance)

If you plot the results plt.plot(c_mean_distances) you should see something like this:

kmeans clusters vs mean value

like image 155
alphaleonis Avatar answered Oct 08 '22 23:10

alphaleonis


You can use following Attribute of KMeans:

cluster_centers_ : array, [n_clusters, n_features]

For every point, test to what cluster it belongs using predict(X) and after that calculate distance to cluster predict returns(it returns index).

like image 39
Farseer Avatar answered Oct 09 '22 01:10

Farseer