Understanding "score" returned by scikit-learn KMeans

Tags:

I applied clustering on a set of text documents (about 100). I converted them to Tfidf vectors using TfIdfVectorizer and supplied the vectors as input to scikitlearn.cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10). Now when I

model.fit() print model.score()

on my vectors, I get a very small value if all the text documents are very similar, and I get a very large negative value if the documents are very different.

It serves my basic purpose of finding which set of documents are similar, but can someone help me understand what exactly does this model.score() value signify for a fit? How can I use this value to justify my findings?

254

asked Sep 03 '15 08:09

Prateek Dewan

2 Answers

In the documentation it says:

Returns:     score : float Opposite of the value of X on the K-means objective.

To understand what that means you need to have a look at the k-means algorithm. What k-means essentially does is find cluster centers that minimize the sum of distances between data samples and their associated cluster centers.

It is a two-step process, where (a) each data sample is associated to its closest cluster center, (b) cluster centers are adjusted to lie at the center of all samples associated to them. These steps are repeated until a criterion (max iterations / min change between last two iterations) is met.

As you can see there remains a distance between the data samples and their associated cluster centers, and the objective of our minimization is that distance (sum of all distances).

You naturally get large distances if you have a big variety in data samples, if the number of data samples is significantly higher than the number of clusters, which in your case is only two. On the contrary, if all data samples were the same, you would always get a zero distance regardless of number of clusters.

From the documentation I would expect that all values are negative, though. If you observe both negative and positive values, maybe there is more to the score than that.

I wonder how you got the idea of clustering into two clusters though.

168

answered Sep 25 '22 13:09

ypnos

The word chosen by the documentation is a bit confusing. It says "Opposite of the value of X on the K-means objective." It means negative of the K-means objective.

K-Means Objective

The objective in the K-means is to reduce the sum of squares of the distances of points from their respective cluster centroids. It has other names like J-Squared error function, J-score or within-cluster sum of squares. This value tells how internally coherent the clusters are. (The less the better)

The objective function can be directly obtained from the following method.

model.inertia_

answered Sep 25 '22 13:09

Tarun Kumar Yellapu

Related questions
                            
                                Getting every child widget of a Tkinter window
                            
                                How is Elastic Net used?
                            
                                How to add virtualenv to path
                            
                                python math domain errors in math.log function
                            
                                Automated docstring and comments spell check
                            
                                AttributeError: 'Namespace' object has no attribute
                            
                                Identifier normalization: Why is the micro sign converted into the Greek letter mu?
                            
                                Pandas update multiple columns at once
                            
                                Left-align a pandas rolling object
                            
                                How to mock a dictionary in Python
                            
                                Serving Python (Flask) REST API over HTTP2
                            
                                Get the bounding box coordinates in the TensorFlow object detection API tutorial
                            
                                FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated use `arr[tuple(seq)]` instead of `arr[seq]`
                            
                                I have a high-performant function written in Julia, how can I use it from Python?
                            
                                How to maintain pip install options in requirements file made by pip freeze?
                            
                                Compare (assert equality of) two complex data structures containing numpy arrays in unittest
                            
                                PyEval_InitThreads in Python 3: How/when to call it? (the saga continues ad nauseam)
                            
                                Django: Can you tell if a related field has been prefetched without fetching it?
                            
                                Multiprocessing : More processes than cpu.count
                            
                                TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'Text'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding "score" returned by scikit-learn KMeans

Tags:

python

k-means

scikit-learn

Prateek Dewan

People also ask

2 Answers

ypnos

Tarun Kumar Yellapu

Recent Activity

Donate For Us