Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can i cluster document using k-means (Flann with python)?

I want to cluster documents based on similarity.

I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).

I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.

My questions are:

  1. Can we do document similarity grouping / Clustering using KMeans (Flann do not allow any text input it seems )
  2. Is Flann the right choice? If not please suggest me High performance library that support text/docs clustering, that have python wrapper/API.
  3. Is k-means the right algorithm?
like image 200
Phyo Arkar Lwin Avatar asked Sep 19 '12 14:09

Phyo Arkar Lwin


2 Answers

You need to represent your document as an array of numbers (aka, a vector). There are many ways to do this, depending on how sophisticated you want to be, but the simplest way is just to represent is as a vector of word counts.

So here's what you do:

  1. Count up the number of times each word appears in the document.

  2. Choose a set of "feature" words that will be included in your vector. This should exclude extremely common words (aka "stopwords") like "the", "a", etc.

  3. Make a vector for each document based on the counts of the feature words.

Here's an example.

If your "documents" are single sentences, and they look like (one doc per line):

there is a dog who chased a cat
someone ate pizza for lunch
the dog and a cat walk down the street toward another dog

If my set of feature words are [dog, cat, street, pizza, lunch], then I can convert each document into a vector:

[1, 1, 0, 0, 0]  // dog 1 time, cat 1 time
[0, 0, 0, 1, 1]  // pizza 1 time, lunch 1 time
[2, 1, 1, 0, 0]  // dog 2 times, cat 1 time, street 1 time

You can use these vectors in your k-means algorithm and it will hopefully group the first and third sentence together because they are similar, and make the second sentence a separate cluster since it is very different.

like image 198
dhg Avatar answered Oct 18 '22 02:10

dhg


There is one big problem here:

K-means is designed for Euclidean distance.

The key problem is the mean function. The mean will reduce variance for Euclidean distance, but it might not do so for a different distance function. So in the worst case, k-means will no longer converge, but run in an infinite loop (although most implementations support stopping at a maximum number of iterations).

Furthermore, the mean is not very sensible for sparse data, and text vectors tend to be very sparse. Roughly speaking the problem is that the mean of a large number of documents will no longer look like a real document, and this way become dissimilar to any real document, and more similar to other mean vectors. So the results to some extend degenerate.

For text vectors, you probably will want to use a different distance function such as cosine similarity.

And of course you first need to compute number vectors. For example by using relative term frequencies, normalizing them via TF-IDF.

There is a variation of the k-means idea known as k-medoids. It can work with arbitrary distance functions, and it avoids the whole "mean" thing by using the real document that is most central to the cluster (the "medoid"). But the known algorithms for this are much slower than k-means.

like image 42
Has QUIT--Anony-Mousse Avatar answered Oct 18 '22 01:10

Has QUIT--Anony-Mousse