Sklearn : Mean Distance from Centroid of each cluster

Tags:

How can i find the mean distance from the centroid to all the data points in each cluster. I am able to find the euclidean distance of each point (in my dataset) from the centroid of each cluster. Now i want to find the mean distance from centroid to all the data points in each cluster. What is a good way of calculating mean distance from each centroid ? So far I have done this..

def k_means(self):
    data = pd.read_csv('hdl_gps_APPLE_20111220_130416.csv', delimiter=',')
    combined_data = data.iloc[0:, 0:4].dropna()
    #print combined_data
    array_convt = combined_data.values
    #print array_convt
    combined_data.head()


    t_data=PCA(n_components=2).fit_transform(array_convt)
    #print t_data
    k_means=KMeans()
    k_means.fit(t_data)
    #------------k means fit predict method for testing purpose-----------------
    clusters=k_means.fit_predict(t_data)
    #print clusters.shape
    cluster_0=np.where(clusters==0)
    print cluster_0

    X_cluster_0 = t_data[cluster_0]
    #print X_cluster_0


    distance = euclidean(X_cluster_0[0], k_means.cluster_centers_[0])
    print distance


    classified_data = k_means.labels_
    #print ('all rows forst column........')
    x_min = t_data[:, 0].min() - 5
    x_max = t_data[:, 0].max() - 1
    #print ('min is ')
    #print x_min
    #print ('max is ')
    #print x_max

    df_processed = data.copy()
    df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)
    #print df_processed

    y_min, y_max = t_data[:, 1].min(), t_data[:, 1].max() + 5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 1), np.arange(y_min, y_max, 1))

    #print ('the mesh grid is: ')

    #print xx
    Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure(1)
    plt.clf()
    plt.imshow(Z, interpolation='nearest',
               extent=(xx.min(), xx.max(), yy.min(), yy.max()),
               cmap=plt.cm.Paired,
               aspect='auto', origin='lower')


    #print Z


    plt.plot(t_data[:, 0], t_data[:, 1], 'k.', markersize=20)
    centroids = k_means.cluster_centers_
    inert = k_means.inertia_
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=169, linewidths=3,
                color='w', zorder=8)
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.show()

In short I want to calculate mean distance of all the data points in particular cluster from the centroid of that cluster as I need to clean my data on the basis of this mean distance

553

asked Nov 27 '16 12:11

Rezwan

2 Answers

Here's one way. You can substitute another distance measure in the function for k_mean_distance() if you want another distance metric other than Euclidean.

Calculate distance between data points for each assigned cluster and cluster centers and return the mean value.

Function for distance calculation:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
    # Calculate Euclidean distance for each data point assigned to centroid 
    distances = [np.sqrt((x-cx)**2+(y-cy)**2) for (x, y) in data[cluster_labels == i_centroid]]
    # return the mean value
    return np.mean(distances)

And for each centroid, use the function to get the mean distance:

total_distance = []
for i, (cx, cy) in enumerate(centroids):
    # Function from above
    mean_distance = k_mean_distance(data, cx, cy, i, cluster_labels)
    total_dist.append(mean_distance)

So, in the context of your question:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
        distances = [np.sqrt((x-cx)**2+(y-cy)**2) for (x, y) in data[cluster_labels == i_centroid]]
        return np.mean(distances)

t_data=PCA(n_components=2).fit_transform(array_convt)
k_means=KMeans()
clusters=k_means.fit_predict(t_data)
centroids = km.cluster_centers_

c_mean_distances = []
for i, (cx, cy) in enumerate(centroids):
    mean_distance = k_mean_distance(t_data, cx, cy, i, clusters)
    c_mean_distances.append(mean_distance)

If you plot the results plt.plot(c_mean_distances) you should see something like this:

kmeans clusters vs mean value

155

answered Oct 08 '22 23:10

alphaleonis

You can use following Attribute of KMeans:

cluster_centers_ : array, [n_clusters, n_features]

For every point, test to what cluster it belongs using predict(X) and after that calculate distance to cluster predict returns(it returns index).

answered Oct 09 '22 01:10

Farseer

Related questions
                            
                                ipdb how to bring the python debugger up to the frame which called the third-party code
                            
                                How to install Python3 to custom path using Chocolatey?
                            
                                What is the simplest python code to plot a simple graph (simpler than matlab)
                            
                                Python - Basic vs extended slicing
                            
                                How to interpret `scipy.stats.kstest` and `ks_2samp` to evaluate `fit` of data to a distribution?
                            
                                JWT encrypting payload in python? (JWE)
                            
                                Retrieving result from celery worker constantly
                            
                                python patch with side_effect on Object's method is not called with self
                            
                                flake8: import statements are in the wrong order
                            
                                Most efficient way to determine overlapping timeseries in Python
                            
                                Install Python 3 to /usr/bin/ on macOS
                            
                                Uploading different versions (python 2.7 vs 3.5) to PyPI
                            
                                Why is `'↊'.isnumeric()` false?
                            
                                Can tqdm be used with Database Reads?
                            
                                python matplotlib bar chart adding bar titles
                            
                                Python multiprocessing - Capturing signals to restart child processes or shut down parent process
                            
                                Does Python's MRO, C3 linearization work depth-first? Empirically it does not
                            
                                Generalizing adding nested lists
                            
                                Using Datetimes with Seaborn's Regplot
                            
                                Django Rest Framework with multiple Viewsets and Routers for the same object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sklearn : Mean Distance from Centroid of each cluster

Tags:

python

numpy

cluster-analysis

k-means

scikit-learn

Rezwan

People also ask

2 Answers

alphaleonis

Farseer

Recent Activity

Donate For Us