Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering conceptually similar documents together?

This is more of a conceptual question than an actual implementation and am hoping someone could clarify. My goal is the following: Given a set of documents, I want to cluster them such that documents belonging to the same cluster have the same "concept".

From what I understand, Latent Semantic Analysis lets me find a low rank approximation of a term-document matrix i.e. given a matrix X, it will decompose X as a product of three matrices, out of which one would be a diagonal matrix Σ:

SVD

Now, I would proceed by choosing a low rank approximation i.e. choose only the top-k values from Σ, and then calculate X'. Once I have this matrix, I have to apply some clustering algorithm and the end result would be set of clusters grouping documents with similar concepts. Is this the right way of applying clustering? I mean, calculating X' and then applying clustering on top of it or is there some other method that is followed?

Also, in a somewhat related question of mine, I was told that the meaning of a neighbor is lost as the number of dimensions increases. In that case, what is the justification for clustering these high dimensional data points from X'? I am guessing that the requirement to cluster similar documents is a real-world requirement in which case, how does one go about addressing this?

like image 826
Legend Avatar asked Jul 07 '11 19:07

Legend


1 Answers

For your first part of your question: No, you do not need to perform any 'clustering' anymore. Such clustering is already available from your singular value decomposition. If this is still unclear, please study more on detailed manner your link Latent Semantic Analysis.

For your second part: please just figure out the first part of your question and then restate this part of your question based on that.

like image 96
eat Avatar answered Sep 20 '22 17:09

eat