I'm following this tutorial from Scikit learn on text clustering using K-Means: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
In the example, optionally LSA (using SVD) is used to perform dimensionality reduction.
Why is this useful ? The number of dimensions (features) can already be controlled in the TF-IDF vectorizer using the "max_features" parameter.
I understand that LSA (and LDA) are also topic modelling techniques. The difference with clustering is that documents belong to multiple topics, but only to one cluster. I do not understand why LSA would be used in the context of K-Means clustering.
Example code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
documents = ["some text", "some other text", "more text"]
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000, min_df=2, stop_words='english', use_idf=True)
X = tfidf_vectorizer.fit_transform(documents)
svd = TruncatedSVD(1000)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
Xnew = lsa.fit_transform(X)
model = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1, verbose=False)
model.fit(Xnew)
LSA
transforms the bag-of-words feature space to a new feature-space (with ortho-normal set of basis vectors) where each dimension represents a latent concept (represented as a linear combination of words in the original dimension).
As with PCA
, a few top eigenvectors generally capture most of the variance in the transformed feature space and the other eigenvectors mainly represent the noise in the dataset, hence, the top eigenvectors in the LSA featurespace can be thought of to be likely to capture most of the concepts defined by the words in the original space.
Hence, reduction in dimension in the transofrmed LSA feature space is likely to be much more effective than in the original BOW
tf-idf
feature space (which simply chops off the less frequent / unimportant words), thereby leading to better quality data after the dimensionality reduction and likely to improve the quality of the clusters.
Additionally, dimension reduction helps to fight the curse of dimensionality problem (e.g., that arises while the distance computation in k-means).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With