I applied clustering on a set of text documents (about 100). I converted them to Tfidf
vectors using TfIdfVectorizer
and supplied the vectors as input to scikitlearn.cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10)
. Now when I
model.fit() print model.score()
on my vectors, I get a very small value if all the text documents are very similar, and I get a very large negative value if the documents are very different.
It serves my basic purpose of finding which set of documents are similar, but can someone help me understand what exactly does this model.score()
value signify for a fit? How can I use this value to justify my findings?
It calculates the sum of the square of the points and calculates the average distance. When the value of k is 1, the within-cluster sum of the square will be high. As the value of k increases, the within-cluster sum of square value will decrease.
K-Means: InertiaInertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster. A good model is one with low inertia AND a low number of clusters ( K ).
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book. Parameters: X{array-like, sparse matrix} of shape (n_samples, n_features) New data to predict.
In the documentation it says:
Returns: score : float Opposite of the value of X on the K-means objective.
To understand what that means you need to have a look at the k-means algorithm. What k-means essentially does is find cluster centers that minimize the sum of distances between data samples and their associated cluster centers.
It is a two-step process, where (a) each data sample is associated to its closest cluster center, (b) cluster centers are adjusted to lie at the center of all samples associated to them. These steps are repeated until a criterion (max iterations / min change between last two iterations) is met.
As you can see there remains a distance between the data samples and their associated cluster centers, and the objective of our minimization is that distance (sum of all distances).
You naturally get large distances if you have a big variety in data samples, if the number of data samples is significantly higher than the number of clusters, which in your case is only two. On the contrary, if all data samples were the same, you would always get a zero distance regardless of number of clusters.
From the documentation I would expect that all values are negative, though. If you observe both negative and positive values, maybe there is more to the score than that.
I wonder how you got the idea of clustering into two clusters though.
The word chosen by the documentation is a bit confusing. It says "Opposite of the value of X on the K-means objective." It means negative of the K-means objective.
K-Means Objective
The objective in the K-means is to reduce the sum of squares of the distances of points from their respective cluster centroids. It has other names like J-Squared error function, J-score or within-cluster sum of squares. This value tells how internally coherent the clusters are. (The less the better)
The objective function can be directly obtained from the following method.
model.inertia_
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With