Find length of cluster (how many point associated with cluster) after KMeans clustering (scikit learn)

Tags:

I have done clustering using Kmeans using sklearn. While it has a method to print the centroids, I am finding it rather bizzare that scikit-learn doesn't have a method to find out the cluster length (or that I have not seen it so far). Is there a neat way to get the cluster-length of each cluster or many points associated with cluster? I currently have this rather cludgy code to do it where I am finding cluster of length one and need to add other point to this cluster by measuring the Euclidean distance between the points and have to update the labels

import numpy as np
from clustering.clusternew import Kmeans_clu
from evolution.generate import reproduction
from mapping.somnew import mapping, no_of_neurons, neuron_weights_init
from population_creation.population import pop_create
from New_SOL import newsol


data = genfromtxt('iris.csv', delimiter=',', skip_header=0, usecols=range(0, 4)) ##Read the input data
actual_label = genfromtxt('iris.csv', delimiter=',', dtype=str,skip_header=0, usecols=(4))
chromosome = int(input("Enter the number of chromosomes: "))  #Input the population size
max_gen = int(input("Enter the maximum number of generation: "))  #Input the maximum number of generation

for i in range(0, chromosome):
    cluster = 3#random.randint(2, max_cluster)  ##Randomly selects cluster number from 2 to root(poplation)
    K.insert(i, cluster)  ##Store the number of clusters in clu
    print('value of K is ',K)
    u, label,z1,A1= Kmeans_clu(cluster, data)
    #print("centers and labels : ", u, label)
    lab.insert(i, label)  ##Store the labels in lab
    center.insert(i, u) 
    new_center = pop_create(max_cluster, features, cluster, u)
    population.insert(i, new_center)
    print("VAlue of population in main\n" ,population)

newsol(max_gen,population,data)

For newsol method we pass the new population from the above method generated code and again doing K-Means on the population

def ClusterIndicesComp(clustNum, labels_array):   #list comprehension for accessing the features in iris data set 
    return np.array([i for i, x in enumerate(labels_array) if x == clustNum])

def newsol(max_gen,population,data):
    #print('VAlue of NewSol Population is',population)
    for i in range(max_gen):
        cluster1=5
        u,label,t,l=Kmeans_clu(cluster1, population)
        A1.insert(i,t)
        plab.insert(i,label)
        pcenter.insert(i,u)
        k2=Counter(l.labels_)  #Count number of elements in each cluster
        k1=[t for (t, v) in k2.items() if v == 1] #element whose length is one will be fetched 
        t1= np.array(k1) #Iterating through the cluster which have one point associated with them 
        for b in range(len(t1)):
            print("Value in NEW_SOL is of 1 length cluster\n",t1[b])
            plot1=data[ClusterIndicesComp(t1[b], l.labels_)]
            print("Values are in sol of plot1",plot1)
            for q in range(cluster1):
                plot2=data[ClusterIndicesComp(q, l.labels_)]
                print("VAlue of plot2 is for \n",q,plot2)
                for i in range(len(plot2)):#To get one element at a time from plot2
                    plotk=plot2[i]
                    if([t for (t, v) in k2.items() if v >2]):#checking if the cluster have more than 2 points than only the distance will be calculated 
                        S=np.linalg.norm(np.array(plot1) - np.array(plotk))
                        print("Distance between plot1 and plotk is",plot1,plotk,np.linalg.norm(np.array(plot1) - np.array(plotk)))#euclidian distance is calculated 
                    else:
                        print("NO distance between them\n")

Kmeans which I have done is

from sklearn.cluster import KMeans
import numpy as np 

def Kmeans_clu(K, data):

    kmeans = KMeans(n_clusters=K, init='random', max_iter=1, n_init=1).fit(data) ##Apply k-means clustering
    labels = kmeans.labels_
    clu_centres = kmeans.cluster_centers_
    z={i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)} #getting cluster for each label 

    return clu_centres, labels ,z,kmeans

216

asked Jun 07 '18 03:06

Shivam Sharma

1 Answers

For getting number of instances in each cluster may be you can try using Counter:

from collections import Counter, defaultdict
print(Counter(estimator.labels_))

Result:

Counter({0: 62, 1: 50, 2: 38})

where cluster 0 has 62 instances, cluster 1 has 50 instances, and cluster 2 has 38 instances

And may be to store index of instances of each clusters, you can use defaultdict:

clusters_indices = defaultdict(list)
for index, c  in enumerate(estimator.labels_):
    clusters_indices[c].append(index)

Now, to find indices of instances in cluster 0, calling:

print(clusters_indices[0])

Result:

[50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 
 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114, 119, 121, 123, 126, 127, 133, 138, 142, 146, 149]

187

answered Sep 19 '22 11:09

student

Related questions
                            
                                Why is my implementations of the log-loss (or cross-entropy) not producing the same results?
                            
                                Can't load URL: The domain of this URL isn't included in the app's domains. Django Facebook Auth
                            
                                PyOpenSSL - how can I get SAN(Subject Alternative Names) list
                            
                                How to crop the biggest object in image with python opencv?
                            
                                Last Record in Column, SQLALCHEMY
                            
                                Raise MigrationSchemaMissing("Unable to create the django_migrations table (%s)" % exc)
                            
                                How to get the same hash in Python3 and Mac / Linux terminal?
                            
                                Python: variable naming convention - file, path, filepath, file_path
                            
                                Flask Admin extend "with select"-dropdown menu with custom button
                            
                                Second derivative in Keras
                            
                                How to use parallel_interleave in TensorFlow
                            
                                ERROR:gpu_process_transport_factory.cc(1007)-Lost UI shared context : while initializing Chrome browser through ChromeDriver in Headless mode
                            
                                Convert epoch time to formatted date string in pandas dataframe
                            
                                Center a label inside a circle with matplotlib
                            
                                Pandas: Get per-year counts for Dateranges spanning multiple years
                            
                                What is the difference between md5sum output and Python hashlib output?
                            
                                I get NotImplementedError when trying to do a prepared statement with mysql python connector
                            
                                QGtkStyle could not resolve GTK
                            
                                How do I find the length of a dataframe in dask?
                            
                                Managing configurations for different environments

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find length of cluster (how many point associated with cluster) after KMeans clustering (scikit learn)

Tags:

python

machine-learning

k-means

unsupervised-learning

scikit-learn

Shivam Sharma

People also ask

1 Answers

student

Recent Activity

Donate For Us