How can i plot a Kmeans text clustering result with matplotlib?

Tags:

I have the following code to cluster some example text with scikit learn.

train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog", "blue sweater", "red hat", "kitty blue"]

vect = TfidfVectorizer()
X = vect.fit_transform(train)
clf = KMeans(n_clusters=3)
clf.fit(X)
centroids = clf.cluster_centers_

plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=80, linewidths=5)
plt.show()

The thing i cant figure out is how i can plot the clustered results. X is a csr_matrix. What i want is (x, y) coord for each result to plot.

784

asked Apr 21 '17 11:04

Anthony De Meulemeester

1 Answers

Here is a longer, better answer with more data:

import matplotlib.pyplot as plt
from numpy import concatenate
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

train = [
    'In 1917 a German Navy flight crashed at/near Off western Denmark with 18 aboard',
    # 'There were 18 passenger/crew fatalities',
    'In 1942 a Deutsche Lufthansa flight crashed at an unknown location with 4 aboard',
    # 'There were 4 passenger/crew fatalities',
    'In 1946 Trans Luxury Airlines flight 878 crashed at/near Moline, Illinois with 25 aboard',
    # 'There were 2 passenger/crew fatalities',
    'In 1947 a Slick Airways flight crashed at/near Hanksville, Utah with 3 aboard',
    'There were 3 passenger/crew fatalities',
    'In 1949 a Royal Canadian Air Force flight crashed at/near Near Bigstone Lake, Manitoba with 21 aboard',
    'There were 21 passenger/crew fatalities',
    'In 1952 a Airwork flight crashed at/near Off Trapani, Italy with 57 aboard',
    'There were 7 passenger/crew fatalities',
    'In 1963 a Aeroflot flight crashed at/near Near Leningrad, Russia with 52 aboard',
    'In 1966 a Alaska Coastal Airlines flight crashed at/near Near Juneau, Alaska with 9 aboard',
    'There were 9 passenger/crew fatalities',
    'In 1986 a Air Taxi flight crashed at/near Frenchglen, Oregon with 6 aboard',
    'There were 3 passenger/crew fatalities',
    'In 1989 a Air Taxi flight crashed at/near Gold Beach, Oregon with 3 aboard',
    'There were 18 passenger/crew fatalities',
    'In 1990 a Republic of China Air Force flight crashed at/near Yunlin, Taiwan with 18 aboard',
    'There were 10 passenger/crew fatalities',
    'In 1992 a Servicios Aereos Santa Ana flight crashed at/near Colorado, Bolivia with 10 aboard',
    'There were 44 passenger/crew fatalities',
    'In 1994 Royal Air Maroc flight 630 crashed at/near Near Agadir, Morocco with 44 aboard',
    'There were 10 passenger/crew fatalities',
    'In 1995 Atlantic Southeast Airlines flight 529 crashed at/near Near Carrollton, GA with 29 aboard',
    'There were 44 passenger/crew fatalities',
    'In 1998 a Lumbini Airways flight crashed at/near Near Ghorepani, Nepal with 18 aboard',
    'There were 18 passenger/crew fatalities',
    'In 2004 a Venezuelan Air Force flight crashed at/near Near Maracay, Venezuela with 25 aboard',
    'There were 25 passenger/crew fatalities',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train)
n_clusters = 2
random_state = 1
clf = KMeans(n_clusters=n_clusters, random_state=random_state)
data = clf.fit(X)
centroids = clf.cluster_centers_
# we want to transform the rows and the centroids
everything = concatenate((X.todense(), centroids))

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 10
model = TSNE(n_components=2, random_state=random_state, init=tsne_init,
    perplexity=tsne_perplexity,
    early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)

transformed_everything = model.fit_transform(everything)
print(transformed_everything)
plt.scatter(transformed_everything[:-n_clusters, 0], transformed_everything[:-n_clusters, 1], marker='x')
plt.scatter(transformed_everything[-n_clusters:, 0], transformed_everything[-n_clusters:, 1], marker='o')

plt.show()

There are two clear clusters in the data: one is a description of the crash, the other is a summary of the fatalities. It is easy to comment out lines and tune the cluster sizes up and down a little. As written the code should show two blue clusters, one larger and one smaller, with two orange centroids. There are more items of data than there are markers: some of the rows of data are transformed onto identical points in space.

two clusters Finally, a smaller t-SNE learning rate seems to produce tighter clusters.

173

answered Oct 09 '22 09:10

Mike DeLong

Related questions
                            
                                Showing total on stacked bar Plotly
                            
                                How does Python ensure the return value of __len__ is an integer when len is called?
                            
                                Add a bookmark to a PDF with PyPDF2
                            
                                Python 3D Plots over non-rectangular domain
                            
                                Remove redundant square brackets in a list python [duplicate]
                            
                                Creating gist directly from Jupyper notebook?
                            
                                Python OpenCV - Extrapolating the largest rectangle off of a set of contour points
                            
                                Incremental Word2Vec Model Training in gensim
                            
                                Python - How to generate the Pairwise Hamming Distance Matrix
                            
                                Django CreateView success message not shown
                            
                                Formatting consecutive numbers
                            
                                How do I receive the data coming from IBs API in Python?
                            
                                Pandas .dt.hour formatting
                            
                                Pandas: How to do a boxplot bases in rows values instead of column values?
                            
                                aws CLI unable to be used due to module colorama
                            
                                sqlalchemy table schema autoload
                            
                                Python pandas -> select by condition in columns name
                            
                                How can I use psycopg2.extras in sqlalchemy?
                            
                                Sum of previous rows values
                            
                                Change table to tall format using panda (UNPIVOT)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can i plot a Kmeans text clustering result with matplotlib?

Tags:

python

matplotlib

machine-learning

scikit-learn

Anthony De Meulemeester

People also ask

1 Answers

Mike DeLong

Recent Activity

Donate For Us