Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why use LSA before K-Means when doing text clustering

I'm following this tutorial from Scikit learn on text clustering using K-Means: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

In the example, optionally LSA (using SVD) is used to perform dimensionality reduction.

Why is this useful ? The number of dimensions (features) can already be controlled in the TF-IDF vectorizer using the "max_features" parameter.

I understand that LSA (and LDA) are also topic modelling techniques. The difference with clustering is that documents belong to multiple topics, but only to one cluster. I do not understand why LSA would be used in the context of K-Means clustering.

Example code:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

documents = ["some text", "some other text", "more text"]

tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000, min_df=2, stop_words='english', use_idf=True)
X = tfidf_vectorizer.fit_transform(documents)

svd = TruncatedSVD(1000)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
Xnew = lsa.fit_transform(X)

model = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1, verbose=False)
model.fit(Xnew)
like image 439
Niko Nelissen Avatar asked Feb 22 '17 14:02

Niko Nelissen


1 Answers

LSA transforms the bag-of-words feature space to a new feature-space (with ortho-normal set of basis vectors) where each dimension represents a latent concept (represented as a linear combination of words in the original dimension).

As with PCA, a few top eigenvectors generally capture most of the variance in the transformed feature space and the other eigenvectors mainly represent the noise in the dataset, hence, the top eigenvectors in the LSA featurespace can be thought of to be likely to capture most of the concepts defined by the words in the original space.

Hence, reduction in dimension in the transofrmed LSA feature space is likely to be much more effective than in the original BOW tf-idf feature space (which simply chops off the less frequent / unimportant words), thereby leading to better quality data after the dimensionality reduction and likely to improve the quality of the clusters.

Additionally, dimension reduction helps to fight the curse of dimensionality problem (e.g., that arises while the distance computation in k-means).

like image 130
Sandipan Dey Avatar answered Nov 14 '22 23:11

Sandipan Dey