Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

plot a document tfidf 2D graph

I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not know how to use that matrix to plot the graph. I am trying to get a plot to see how well my sentences can be classified using kmeans.

Here is the output of fit_transform(sentence_list):

(document id, term number) tfidf score

(0, 1023)   0.209291711271
(0, 924)    0.174405532933
(0, 914)    0.174405532933
(0, 821)    0.15579574484
(0, 770)    0.174405532933
(0, 763)    0.159719994016
(0, 689)    0.135518787598

Here is my code:

sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples:  %d, num_features: %d" %(num_samples,num_features)
num_clusters=10
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
PRINT km.labels_   # Returns a list of clusters ranging 0 to 10 

Thanks,

like image 921
jxn Avatar asked Jan 26 '15 23:01

jxn


People also ask

How do you use a TF-IDF Vectorizer?

TF-IDF Vectorizer is a measure of originality of a word by comparing the number of times a word appears in document with the number of documents the word appears in. formula for TF-IDF is: TF-IDF = TF(t, d) x IDF(t), where, TF(t, d) = Number of times term "t" appears in a document "d".

What does Tfidf transform do?

Conclusion. TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document. It's a relatively simple but intuitive approach to weighting words, allowing it to act as a great jumping off point for a variety of tasks ...


1 Answers

When you use Bag of Words, each of your sentences gets represented in a high dimensional space of length equal to the vocabulary. If you want to represent this in 2D you need to reduce the dimension, for example using PCA with two components:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

newsgroups_train = fetch_20newsgroups(subset='train', 
                                      categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
])        
X = pipeline.fit_transform(newsgroups_train.data).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=data.target)
plt.show()              #not required if using ipython notebook

data2d

Now you can for example calculate and plot the cluster enters on this data:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)

plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], 
            marker='x', s=200, linewidths=3, c='r')
plt.show()              #not required if using ipython notebook

enter image description here

like image 191
elyase Avatar answered Sep 23 '22 23:09

elyase