How do i visualize data points of tf-idf vectors for kmeans clustering?

Tags:

I have a list of documents and the tf-idf score for each unique word in the entire corpus. How do I visualize that on a 2-d plot to give me a gauge of how many clusters I will need to run k-means?

Here is my code:

sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples:  %d, num_features: %d" %(num_samples,num_features)
num_clusters=10

As you can see, I am able to transform my sentences into a tf-idf document matrix. But I am unsure how to plot the data points of the tf-idf score.

I was thinking:

Add more variables like document length and something else
do PCA to get an output of 2 dimensions

Thanks

850

asked Dec 15 '14 22:12

jxn

2 Answers

I am doing something similar at the moment, trying to plot in 2D, tf-idf scores for a dataset of texts. My approach, similar to suggestions in other comments, is to use PCA and t-SNE from scikit-learn.

import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

num_clusters = 10
num_seeds = 10
max_iterations = 300
labels_color_map = {
    0: '#20b2aa', 1: '#ff7373', 2: '#ffe4e1', 3: '#005073', 4: '#4d0404',
    5: '#ccc0ba', 6: '#4700f9', 7: '#f6f900', 8: '#00f91d', 9: '#da8c49'
}
pca_num_components = 2
tsne_num_components = 2

# texts_list = some array of strings for which TF-IDF is being computed

# calculate tf-idf of texts
tf_idf_vectorizer = TfidfVectorizer(analyzer="word", use_idf=True, smooth_idf=True, ngram_range=(2, 3))
tf_idf_matrix = tf_idf_vectorizer.fit_transform(texts_list)

# create k-means model with custom config
clustering_model = KMeans(
    n_clusters=num_clusters,
    max_iter=max_iterations,
    precompute_distances="auto",
    n_jobs=-1
)

labels = clustering_model.fit_predict(tf_idf_matrix)
# print labels

X = tf_idf_matrix.todense()

# ----------------------------------------------------------------------------------------------------------------------

reduced_data = PCA(n_components=pca_num_components).fit_transform(X)
# print reduced_data

fig, ax = plt.subplots()
for index, instance in enumerate(reduced_data):
    # print instance, index, labels[index]
    pca_comp_1, pca_comp_2 = reduced_data[index]
    color = labels_color_map[labels[index]]
    ax.scatter(pca_comp_1, pca_comp_2, c=color)
plt.show()



# t-SNE plot
embeddings = TSNE(n_components=tsne_num_components)
Y = embeddings.fit_transform(X)
plt.scatter(Y[:, 0], Y[:, 1], cmap=plt.cm.Spectral)
plt.show()

answered Oct 13 '22 07:10

gorjanz

PCA is one approach. For TF-IDF I have also used Scikit Learn's manifold package for non-linear dimension reduction. One thing that I find helpful is to label my points based on the TF-IDF scores.

Here's an example (need to insert your TF-IDF implementation at beginning):

from sklearn import manifold

# Insert your TF-IDF vectorizing here

##
# Do the dimension reduction
##
k = 10 # number of nearest neighbors to consider
d = 2 # dimensionality
pos = manifold.Isomap(k, d, eigen_solver='auto').fit_transform(.toarray())

##
# Get meaningful "cluster" labels
##
#Semantic labeling of cluster. Apply a label if the clusters max TF-IDF is in the 99% quantile of the whole corpus of TF-IDF scores
labels = vectorizer.get_feature_names() #text labels of features
clusterLabels = []
t99 = scipy.stats.mstats.mquantiles(X.data, [ 0.99])[0]
clusterLabels = []
for i in range(0,vectorized.shape[0]):
    row = vectorized.getrow(i)
    if row.max() >= t99:
        arrayIndex = numpy.where(row.data == row.max())[0][0]
        clusterLabels.append(labels[row.indices[arrayIndex]])
    else:
        clusterLabels.append('')
##
# Plot the dimension reduced data
##
pyplot.xlabel('reduced dimension-1')
pyplot.ylabel('reduced dimension-2')
for i in range(1, len(pos)):
    pyplot.scatter(pos[i][0], pos[i][1], c='cyan')
    pyplot.annotate(clusterLabels[i], pos[i], xytext=None, xycoords='data', textcoords='data', arrowprops=None)

pyplot.show()

answered Oct 13 '22 07:10

andrew

Related questions
                            
                                Imported module not found in PyInstaller
                            
                                python csv reader, loop from the second row
                            
                                pySerial 2.6: specify end-of-line in readline()
                            
                                Convert string to JSON in Python?
                            
                                Write a binary integer or string to a file in python
                            
                                How to get reference to module by string name and call its method by string name?
                            
                                Migrating to MongoDB: how to query GROUP BY + WHERE
                            
                                fast python numpy where functionality?
                            
                                Python: Best Way to remove duplicate character from string
                            
                                How to reverse words in Python [duplicate]
                            
                                OSX Mavericks broken pip and virtualenv
                            
                                How to create ENUM in SQLAlchemy?
                            
                                Python 3: Making a str object callable
                            
                                What is a broken pipe error?
                            
                                What are the differences between slices and partitions of RDDs?
                            
                                Why does "numpy.mean" return 'inf'?
                            
                                How can I get QListWidget item by name?
                            
                                Volume of convex hull with QHull from SciPy
                            
                                Convert negative number to positive number in django template?
                            
                                Unable to install GDB with python support

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do i visualize data points of tf-idf vectors for kmeans clustering?

Tags:

python

scipy

k-means

scikit-learn

tf-idf

jxn

People also ask

2 Answers

gorjanz

andrew

Recent Activity

Donate For Us