k-fold Cross Validation for determining k in k-means?

Tags:

In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U, S and Vt and then by choosing a suitable number of eigen values I truncated Vt, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k in k-means, I was suggested to look at cross-validation.

Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans is to simply use the function from scipy.

import numpy, scipy

# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix

# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:]) 

# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)

Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.

622

asked Jul 08 '11 19:07

Legend

1 Answers

To run k-fold cross validation, you'd need some measure of quality to optimize for. This could be either a classification measure such as accuracy or F₁, or a specialized one such as the V-measure.

Even the clustering quality measures that I know of need a labeled dataset ("ground truth") to work; the difference with classification is that you only need part of your data to be labeled for the evaluation, while the k-means algorithm can make use all the data to determine the centroids and thus the clusters.

V-measure and several other scores are implemented in scikit-learn, as well as generic cross validation code and a "grid search" module that optimizes according to a specified measure of evaluation using k-fold CV. Disclaimer: I'm involved in scikit-learn development, though I didn't write any of the code mentioned.

160

answered Sep 22 '22 11:09

Fred Foo

Related questions
                            
                                Bypass Twitter OAuth
                            
                                wxpython scrolled Panel Not Updating scroll bars
                            
                                How to uninstall wxpython 2.8 on Mac OSX 10.6
                            
                                django ForeignKey to any type of model
                            
                                How to call methods on Python class descriptor objects?
                            
                                Python: How to determine subprocess children have all finished running
                            
                                Call prolog predicate from python
                            
                                Sorting entities and filtering ListProperty without incurring in exploding indexes
                            
                                Off by one error in imshow?
                            
                                numpy array of python objects
                            
                                Django: How do I update a model after a view has been rendered?
                            
                                Need help building an SQLAlchemy query + subquery
                            
                                How to execute Python CGI Script?
                            
                                Python/Django - *args as list
                            
                                Shuffle the records of a list of text files in one single file
                            
                                tkinter: check modified
                            
                                Python Regular Expressions to extract date
                            
                                Navigate manually with a cursor through nested lists by only providing "left()" and "right()" as commands?
                            
                                How can I decode this utf-8 string, picked on a random website and saved by the Django ORM, using Python?
                            
                                Jinja2: How to create multidimensional javascript array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

k-fold Cross Validation for determining k in k-means?

Tags:

python

machine-learning

numpy

statistics

nlp

Legend

People also ask

1 Answers

Fred Foo

Recent Activity

Donate For Us