Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does kmeans know how to cluster documents when we only feed it tfidf vectors of individual words?

I am using scikit learn's Kmeans algorithm to cluster comments.

sentence_list=['hello how are you', "I am doing great", "my name is abc"]

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)

km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)

when i print the output of vectorized, it gives me the index of the words and the tf-idf scores of the index.

So im wondering, given we only get the tfidf scores of words, how is it that we manage to cluster documents based on individual words and not the score of an entire document? Or maybe it does this..Can someone explain to me the concept behind this?

like image 784
jxn Avatar asked Dec 21 '14 01:12

jxn


2 Answers

You should take a look at how the Kmeans algorithm works. First the stop words never make it to vectorized, therefore are totally ignored by Kmeans and don't have any influence in how the documents are clustered. Now suppose you have:

sentence_list=["word1", "word2", "word2 word3"]

Lets say you want 2 clusters. In this case you expect the second and the third document to be in the same cluster because they share a common word. Lets see how this happens.

The numeric representation of the docs vectorized looks like:

word1     word3     word2
    1  0.000000  0.000000     # doc 1
    0  1.000000  0.000000     # doc 2
    0  0.605349  0.795961     # doc 3

In the first step of Kmeans, some centroids are randomly chosen from the data, for example, the document 1 and the document 3 will be the initial centroids:

Centroid 1:     [1, 0.000000, 0.000000]

Centroid 2:     [0, 0.605349, 0.795961]

Now if you calculate the distances from every point(document) to each one of the two centroids, you will see that:

  • document 1 has distance 0 to centroid 1 so it belongs to centroid 1
  • document 3 has distance 0 to centroid 2 so it belongs to centroid 2

Finally we calculate the distance between the remaining document 2 and each centroid to find out which one it belongs to:

>>> from scipy.spatial.distance import euclidean

>>> euclidean([0, 1, 0], [1, 0, 0])               # dist(doc2, centroid1)
1.4142135623730951

>>> euclidean([0, 1, 0], [0, 0.605349, 0.795961]) # dist(doc2, centroid2)
0.8884272507056005

So the 2nd document and the second centroid are closer, this means that the second document is assigned to the 2nd centroid.

like image 183
elyase Avatar answered Oct 15 '22 03:10

elyase


TF/IDF is a measure that calculates the importance of a word in a document with respect to the rest of the words in that document. It does not compute the importance of a standalone word. (and it makes sense, right? Because importance always means priviledge over others!). So TF/IDF of each word is actually an importance measure of a document with respect to the word.

I don't see where TF/IDF is used in your code. However, it is possible to compute the kmeans algorithm with TF/IDF scores used as features. Also, clustering for the three sample documents that you have mentioned is simply impossible, while no two documents there have a common word!

Edit 1: First of all, if the word 'cat' occurs in two documents it is possible that they would be clustered together (depending on other words in the two documents and also other documents). Secondly, you should learn more about k-means. You see, kmeans uses features to cluster documents together, and each tf/idf score for each word in a document is a feature measure that's been used to compare that document against others on a corpus.

like image 35
user823743 Avatar answered Oct 15 '22 02:10

user823743