Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding "score" returned by scikit-learn KMeans

I applied clustering on a set of text documents (about 100). I converted them to Tfidf vectors using TfIdfVectorizer and supplied the vectors as input to scikitlearn.cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10). Now when I

model.fit() print model.score() 

on my vectors, I get a very small value if all the text documents are very similar, and I get a very large negative value if the documents are very different.

It serves my basic purpose of finding which set of documents are similar, but can someone help me understand what exactly does this model.score() value signify for a fit? How can I use this value to justify my findings?

like image 254
Prateek Dewan Avatar asked Sep 03 '15 08:09

Prateek Dewan


People also ask

How do you interpret the results of k-means clustering?

It calculates the sum of the square of the points and calculates the average distance. When the value of k is 1, the within-cluster sum of the square will be high. As the value of k increases, the within-cluster sum of square value will decrease.

What does the Kmeans score mean?

K-Means: InertiaInertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster. A good model is one with low inertia AND a low number of clusters ( K ).

What does Cluster_centers_ attribute results into?

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book. Parameters: X{array-like, sparse matrix} of shape (n_samples, n_features) New data to predict.


2 Answers

In the documentation it says:

Returns:     score : float Opposite of the value of X on the K-means objective. 

To understand what that means you need to have a look at the k-means algorithm. What k-means essentially does is find cluster centers that minimize the sum of distances between data samples and their associated cluster centers.

It is a two-step process, where (a) each data sample is associated to its closest cluster center, (b) cluster centers are adjusted to lie at the center of all samples associated to them. These steps are repeated until a criterion (max iterations / min change between last two iterations) is met.

As you can see there remains a distance between the data samples and their associated cluster centers, and the objective of our minimization is that distance (sum of all distances).

You naturally get large distances if you have a big variety in data samples, if the number of data samples is significantly higher than the number of clusters, which in your case is only two. On the contrary, if all data samples were the same, you would always get a zero distance regardless of number of clusters.

From the documentation I would expect that all values are negative, though. If you observe both negative and positive values, maybe there is more to the score than that.

I wonder how you got the idea of clustering into two clusters though.

like image 168
ypnos Avatar answered Sep 25 '22 13:09

ypnos


The word chosen by the documentation is a bit confusing. It says "Opposite of the value of X on the K-means objective." It means negative of the K-means objective.

K-Means Objective

The objective in the K-means is to reduce the sum of squares of the distances of points from their respective cluster centroids. It has other names like J-Squared error function, J-score or within-cluster sum of squares. This value tells how internally coherent the clusters are. (The less the better)

The objective function can be directly obtained from the following method.

model.inertia_

like image 27
Tarun Kumar Yellapu Avatar answered Sep 25 '22 13:09

Tarun Kumar Yellapu