Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify Cluster labels in kmeans scikit learn

I am learning python scikit. The example given here displays the top occurring words in each Cluster and not Cluster name.

http://scikit-learn.org/stable/auto_examples/document_clustering.html

I found that the km object has "km.label" which lists the centroid id, which is the number.

I have two question

1. How do I generate the cluster labels?
2. How to identify the members of the clusters for further processing.

I have working knowledge of k-means and aware of tf-ids concepts.

like image 842
vij555 Avatar asked Feb 05 '15 13:02

vij555


People also ask

What is a cluster label?

Cluster labeling is the assignment of rep- resentative labels to clusters of documents or words. Once assigned, the labels can play an important role in applications such as navigation, search and document clas- sification. However, finding appropriately descriptive labels is still a challenging task.

How do you guess the number of clusters?

The “Elbow” Method Probably the most well known method, the elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters.


1 Answers

  1. How do I generate the cluster labels?

I'm not sure what you mean by this. You have no cluster labels other than cluster 1, cluster 2, ..., cluster n. That is why it's called unsupervised learning, because there are no labels.

Do you mean you actually have labels and you want to see if the clustering algorithm happened to cluster the data according to your labels?

In that case, the documentation you linked to provides an example:

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
  1. How to identify the members of the clusters for further processing.

See the documentation for KMeans. In particular, the predict method:

predict(X)

Parameters: X : {array-like, sparse matrix}, shape = [n_samples, n_features] New data to predict.

Returns:
labels : array, shape [n_samples,] Index of the cluster each sample belongs to.

If you don't want to predict something new, km.labels_ should do that for the training data.

like image 83
IVlad Avatar answered Sep 21 '22 18:09

IVlad