I want to cluster documents based on similarity.
I haved tried ssdeep (similarity hashing), very fast but i was told that k-means is faster and flann is fastest of all implementations, and more accurate so i am trying flann with python bindings but i can't find any example how to do it on text (it only support array of numbers).
I am very very new to this field (k-means, natural language processing). What i need is speed and accuracy.
My questions are:
You need to represent your document as an array of numbers (aka, a vector). There are many ways to do this, depending on how sophisticated you want to be, but the simplest way is just to represent is as a vector of word counts.
So here's what you do:
Count up the number of times each word appears in the document.
Choose a set of "feature" words that will be included in your vector. This should exclude extremely common words (aka "stopwords") like "the", "a", etc.
Make a vector for each document based on the counts of the feature words.
Here's an example.
If your "documents" are single sentences, and they look like (one doc per line):
there is a dog who chased a cat
someone ate pizza for lunch
the dog and a cat walk down the street toward another dog
If my set of feature words are [dog, cat, street, pizza, lunch]
, then I can convert each document into a vector:
[1, 1, 0, 0, 0] // dog 1 time, cat 1 time
[0, 0, 0, 1, 1] // pizza 1 time, lunch 1 time
[2, 1, 1, 0, 0] // dog 2 times, cat 1 time, street 1 time
You can use these vectors in your k-means algorithm and it will hopefully group the first and third sentence together because they are similar, and make the second sentence a separate cluster since it is very different.
There is one big problem here:
K-means is designed for Euclidean distance.
The key problem is the mean function. The mean will reduce variance for Euclidean distance, but it might not do so for a different distance function. So in the worst case, k-means will no longer converge, but run in an infinite loop (although most implementations support stopping at a maximum number of iterations).
Furthermore, the mean is not very sensible for sparse data, and text vectors tend to be very sparse. Roughly speaking the problem is that the mean of a large number of documents will no longer look like a real document, and this way become dissimilar to any real document, and more similar to other mean vectors. So the results to some extend degenerate.
For text vectors, you probably will want to use a different distance function such as cosine similarity.
And of course you first need to compute number vectors. For example by using relative term frequencies, normalizing them via TF-IDF.
There is a variation of the k-means idea known as k-medoids. It can work with arbitrary distance functions, and it avoids the whole "mean" thing by using the real document that is most central to the cluster (the "medoid"). But the known algorithms for this are much slower than k-means.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With