Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn decomposition top terms

Is there a way I can determine the top features/terms for each cluster in while the data was decomposed?

in th example from the sklearn documentation, the top terms are extracted by sorting the features and comparing with the vectorizer feature_names, both with the same number of features.

http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html

I would like to know how to implement get_top_terms_per_cluster():

X = vectorizer.fit_transform(dataset)  # with m features
X = lsa.fit_transform(X)  # reduce number of features to m'
k_means.fit(X)
get_top_terms_per_cluster()  # out of m features
like image 896
Ofer Helman Avatar asked Feb 12 '23 18:02

Ofer Helman


1 Answers

Assuming lsa = TruncatedSVD(n_components=k) for some k, the obvious way to get term weights makes use of the fact that LSA/SVD is a linear transformation, i.e., each row of lsa.components_ is a weighted sum of the input terms, and you can multiply that with the cluster centroids from k-means.

Let's set some things up and train some models:

>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import TruncatedSVD
>>> data = fetch_20newsgroups()
>>> vectorizer = TfidfVectorizer(min_df=3, max_df=.95, stop_words='english')
>>> lsa = TruncatedSVD(n_components=10)
>>> km = KMeans(n_clusters=3)
>>> X = vectorizer.fit_transform(data.data)
>>> X_lsa = lsa.fit_transform(X)
>>> km.fit(X_lsa)

Now multiply the LSA components and the k-means centroids:

>>> X.shape
(11314, 38865)
>>> lsa.components_.shape
(10, 38865)
>>> km.cluster_centers_.shape
(3, 10)
>>> weights = np.dot(km.cluster_centers_, lsa.components_)
>>> weights.shape
(3, 38865)

Then print; we need absolute values for the weights because of the sign indeterminacy in LSA:

>>> features = vectorizer.get_feature_names()
>>> weights = np.abs(weights)
>>> for i in range(km.n_clusters):
...     top5 = np.argsort(weights[i])[-5:]
...     print(zip([features[j] for j in top5], weights[i, top5]))
...     
[(u'escrow', 0.042965734662740895), (u'chip', 0.07227072329320372), (u'encryption', 0.074855609122467345), (u'clipper', 0.075661844826553887), (u'key', 0.095064798549230306)]
[(u'posting', 0.012893125486957332), (u'article', 0.013105911161236845), (u'university', 0.0131617377000081), (u'com', 0.023016036009601809), (u'edu', 0.034532489348082958)]
[(u'don', 0.02087448155525683), (u'com', 0.024327099321009758), (u'people', 0.033365757270264217), (u'edu', 0.036318114826463417), (u'god', 0.042203130080860719)]

Mind you, you really need a stop word filter for this to work. The stop words tend to end up in every single component, and get a high weight in every cluster centroid.

like image 158
Fred Foo Avatar answered Feb 14 '23 09:02

Fred Foo