How to get the text of cluster centers from scikit-learn KMeans?

Question

I have a list of strings that I use to fit sklearn.cluster.KMeans:

X = TfidfVectorizer().fit_transform(docs)
km = KMeans().fit(X)

Now I would like to get the cluster centers in their original string representation. I know km.cluster_centers_ but could not figure out how to get the relevant indices of docs.

Fred Foo · Accepted Answer

There is no "original representation" of the cluster centers in k-means; they are not actually points (vectorized documents) from the input set, but means of multiple points. Such means cannot be transformed back into documents since the bag-of-words representation destroys the order of terms.

One possible approximation is to take a centroid vector, then use TfidfVectorizer.inverse_transform on it to find out which terms have non-zero tf-idf value in it.

You could achieve what you want with the k-medoids algorithm, which does assign actual input points as centroids, but that is not implemented in scikit-learn.

How to get the text of cluster centers from scikit-learn KMeans?

Tags:

python

machine-learning

k-means

scikit-learn

Mathias Loesch

1 Answers

Fred Foo

Recent Activity

Donate For Us

How to get the text of cluster centers from scikit-learn KMeans?

Tags:

python

machine-learning

k-means

scikit-learn

Mathias Loesch

1 Answers

Fred Foo

Related questions

Recent Activity

Donate For Us