Is there a way I can determine the top features/terms for each cluster in while the data was decomposed?
in th example from the sklearn documentation, the top terms are extracted by sorting the features and comparing with the vectorizer feature_names, both with the same number of features.
http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html
I would like to know how to implement get_top_terms_per_cluster():
X = vectorizer.fit_transform(dataset) # with m features
X = lsa.fit_transform(X) # reduce number of features to m'
k_means.fit(X)
get_top_terms_per_cluster() # out of m features
Assuming lsa = TruncatedSVD(n_components=k)
for some k
, the obvious way to get term weights makes use of the fact that LSA/SVD is a linear transformation, i.e., each row of lsa.components_
is a weighted sum of the input terms, and you can multiply that with the cluster centroids from k-means.
Let's set some things up and train some models:
>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import TruncatedSVD
>>> data = fetch_20newsgroups()
>>> vectorizer = TfidfVectorizer(min_df=3, max_df=.95, stop_words='english')
>>> lsa = TruncatedSVD(n_components=10)
>>> km = KMeans(n_clusters=3)
>>> X = vectorizer.fit_transform(data.data)
>>> X_lsa = lsa.fit_transform(X)
>>> km.fit(X_lsa)
Now multiply the LSA components and the k-means centroids:
>>> X.shape
(11314, 38865)
>>> lsa.components_.shape
(10, 38865)
>>> km.cluster_centers_.shape
(3, 10)
>>> weights = np.dot(km.cluster_centers_, lsa.components_)
>>> weights.shape
(3, 38865)
Then print; we need absolute values for the weights because of the sign indeterminacy in LSA:
>>> features = vectorizer.get_feature_names()
>>> weights = np.abs(weights)
>>> for i in range(km.n_clusters):
... top5 = np.argsort(weights[i])[-5:]
... print(zip([features[j] for j in top5], weights[i, top5]))
...
[(u'escrow', 0.042965734662740895), (u'chip', 0.07227072329320372), (u'encryption', 0.074855609122467345), (u'clipper', 0.075661844826553887), (u'key', 0.095064798549230306)]
[(u'posting', 0.012893125486957332), (u'article', 0.013105911161236845), (u'university', 0.0131617377000081), (u'com', 0.023016036009601809), (u'edu', 0.034532489348082958)]
[(u'don', 0.02087448155525683), (u'com', 0.024327099321009758), (u'people', 0.033365757270264217), (u'edu', 0.036318114826463417), (u'god', 0.042203130080860719)]
Mind you, you really need a stop word filter for this to work. The stop words tend to end up in every single component, and get a high weight in every cluster centroid.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With