Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can i plot a Kmeans text clustering result with matplotlib?

I have the following code to cluster some example text with scikit learn.

train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog", "blue sweater", "red hat", "kitty blue"]

vect = TfidfVectorizer()
X = vect.fit_transform(train)
clf = KMeans(n_clusters=3)
clf.fit(X)
centroids = clf.cluster_centers_

plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=80, linewidths=5)
plt.show()

The thing i cant figure out is how i can plot the clustered results. X is a csr_matrix. What i want is (x, y) coord for each result to plot.

Ty

like image 784
Anthony De Meulemeester Avatar asked Apr 21 '17 11:04

Anthony De Meulemeester


People also ask

How do you visualize k-means clustering?

The k-means algorithm captures the insight that each point in a cluster should be near to the center of that cluster. It works like this: first we choose k, the number of clusters we want to find in the data. Then, the centers of those k clusters, called centroids, are initialized in some fashion, (discussed later).


1 Answers

Here is a longer, better answer with more data:

import matplotlib.pyplot as plt
from numpy import concatenate
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

train = [
    'In 1917 a German Navy flight crashed at/near Off western Denmark with 18 aboard',
    # 'There were 18 passenger/crew fatalities',
    'In 1942 a Deutsche Lufthansa flight crashed at an unknown location with 4 aboard',
    # 'There were 4 passenger/crew fatalities',
    'In 1946 Trans Luxury Airlines flight 878 crashed at/near Moline, Illinois with 25 aboard',
    # 'There were 2 passenger/crew fatalities',
    'In 1947 a Slick Airways flight crashed at/near Hanksville, Utah with 3 aboard',
    'There were 3 passenger/crew fatalities',
    'In 1949 a Royal Canadian Air Force flight crashed at/near Near Bigstone Lake, Manitoba with 21 aboard',
    'There were 21 passenger/crew fatalities',
    'In 1952 a Airwork flight crashed at/near Off Trapani, Italy with 57 aboard',
    'There were 7 passenger/crew fatalities',
    'In 1963 a Aeroflot flight crashed at/near Near Leningrad, Russia with 52 aboard',
    'In 1966 a Alaska Coastal Airlines flight crashed at/near Near Juneau, Alaska with 9 aboard',
    'There were 9 passenger/crew fatalities',
    'In 1986 a Air Taxi flight crashed at/near Frenchglen, Oregon with 6 aboard',
    'There were 3 passenger/crew fatalities',
    'In 1989 a Air Taxi flight crashed at/near Gold Beach, Oregon with 3 aboard',
    'There were 18 passenger/crew fatalities',
    'In 1990 a Republic of China Air Force flight crashed at/near Yunlin, Taiwan with 18 aboard',
    'There were 10 passenger/crew fatalities',
    'In 1992 a Servicios Aereos Santa Ana flight crashed at/near Colorado, Bolivia with 10 aboard',
    'There were 44 passenger/crew fatalities',
    'In 1994 Royal Air Maroc flight 630 crashed at/near Near Agadir, Morocco with 44 aboard',
    'There were 10 passenger/crew fatalities',
    'In 1995 Atlantic Southeast Airlines flight 529 crashed at/near Near Carrollton, GA with 29 aboard',
    'There were 44 passenger/crew fatalities',
    'In 1998 a Lumbini Airways flight crashed at/near Near Ghorepani, Nepal with 18 aboard',
    'There were 18 passenger/crew fatalities',
    'In 2004 a Venezuelan Air Force flight crashed at/near Near Maracay, Venezuela with 25 aboard',
    'There were 25 passenger/crew fatalities',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train)
n_clusters = 2
random_state = 1
clf = KMeans(n_clusters=n_clusters, random_state=random_state)
data = clf.fit(X)
centroids = clf.cluster_centers_
# we want to transform the rows and the centroids
everything = concatenate((X.todense(), centroids))

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 10
model = TSNE(n_components=2, random_state=random_state, init=tsne_init,
    perplexity=tsne_perplexity,
    early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)

transformed_everything = model.fit_transform(everything)
print(transformed_everything)
plt.scatter(transformed_everything[:-n_clusters, 0], transformed_everything[:-n_clusters, 1], marker='x')
plt.scatter(transformed_everything[-n_clusters:, 0], transformed_everything[-n_clusters:, 1], marker='o')

plt.show()

There are two clear clusters in the data: one is a description of the crash, the other is a summary of the fatalities. It is easy to comment out lines and tune the cluster sizes up and down a little. As written the code should show two blue clusters, one larger and one smaller, with two orange centroids. There are more items of data than there are markers: some of the rows of data are transformed onto identical points in space.

two clusters Finally, a smaller t-SNE learning rate seems to produce tighter clusters.

like image 173
Mike DeLong Avatar answered Oct 09 '22 09:10

Mike DeLong