sklearn decomposition top terms

Question

Is there a way I can determine the top features/terms for each cluster in while the data was decomposed?

in th example from the sklearn documentation, the top terms are extracted by sorting the features and comparing with the vectorizer feature_names, both with the same number of features.

http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html

I would like to know how to implement get_top_terms_per_cluster():

X = vectorizer.fit_transform(dataset)  # with m features
X = lsa.fit_transform(X)  # reduce number of features to m'
k_means.fit(X)
get_top_terms_per_cluster()  # out of m features

Fred Foo · Accepted Answer

Assuming lsa = TruncatedSVD(n_components=k) for some k, the obvious way to get term weights makes use of the fact that LSA/SVD is a linear transformation, i.e., each row of lsa.components_ is a weighted sum of the input terms, and you can multiply that with the cluster centroids from k-means.

Let's set some things up and train some models:

>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import TruncatedSVD
>>> data = fetch_20newsgroups()
>>> vectorizer = TfidfVectorizer(min_df=3, max_df=.95, stop_words='english')
>>> lsa = TruncatedSVD(n_components=10)
>>> km = KMeans(n_clusters=3)
>>> X = vectorizer.fit_transform(data.data)
>>> X_lsa = lsa.fit_transform(X)
>>> km.fit(X_lsa)

Now multiply the LSA components and the k-means centroids:

>>> X.shape
(11314, 38865)
>>> lsa.components_.shape
(10, 38865)
>>> km.cluster_centers_.shape
(3, 10)
>>> weights = np.dot(km.cluster_centers_, lsa.components_)
>>> weights.shape
(3, 38865)

Then print; we need absolute values for the weights because of the sign indeterminacy in LSA:

>>> features = vectorizer.get_feature_names()
>>> weights = np.abs(weights)
>>> for i in range(km.n_clusters):
...     top5 = np.argsort(weights[i])[-5:]
...     print(zip([features[j] for j in top5], weights[i, top5]))
...     
[(u'escrow', 0.042965734662740895), (u'chip', 0.07227072329320372), (u'encryption', 0.074855609122467345), (u'clipper', 0.075661844826553887), (u'key', 0.095064798549230306)]
[(u'posting', 0.012893125486957332), (u'article', 0.013105911161236845), (u'university', 0.0131617377000081), (u'com', 0.023016036009601809), (u'edu', 0.034532489348082958)]
[(u'don', 0.02087448155525683), (u'com', 0.024327099321009758), (u'people', 0.033365757270264217), (u'edu', 0.036318114826463417), (u'god', 0.042203130080860719)]

Mind you, you really need a stop word filter for this to work. The stop words tend to end up in every single component, and get a high weight in every cluster centroid.

sklearn decomposition top terms

Tags:

python

scikit-learn

Ofer Helman

1 Answers

Fred Foo

Recent Activity

Donate For Us

sklearn decomposition top terms

Tags:

python

scikit-learn

Ofer Helman

1 Answers

Fred Foo

Related questions

Recent Activity

Donate For Us