Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a convenient way to do document clustering with elasticsearch?

I have stored a lot of news articles from RSS feeds from different sources in an elasticsearch index. At the moment when I do a search query, it will return me a lot of similar news articles for one query, because the same news topics gets covered by many RSS sources.

Instead what I would like to do is return only one news article out of a group of articles to the same topic. So I somehow need to recognize, which articles are about the same topic, cluster these documents and return only the "best" article out of such a cluster.

What would be the most convenient way to approach that problem? Can I somehow make use of the elasticsearch more-like-this API? Or is the https://github.com/carrot2/elasticsearch-carrot2 plugin the way to go? Or is there simply no convenient way and I have to implement somehow my own version of http://en.wikipedia.org/wiki/K-means_clustering or http://en.wikipedia.org/wiki/Non-negative_matrix_factorization to cluster my documents?

like image 975
asmaier Avatar asked Feb 06 '15 17:02

asmaier


3 Answers

  1. ES is not particularly useful for clustering. Most clustering algorithms require pairwise distance computations, which is easiest if you can fit all your data into a huge matrix (and then factor it) So it may well be easier (and faster) to work outside ES!

  2. None of the approaches work half as good as advertised. See e.g. “reading tea leaves”. Everybody who constructs such an algorithm is happy to get anything out, and will tune and fiddle parameters and rerun until the result looks nice. The technical term is cherry picking. Evaluation is incredibly sloppy, and if you look at the results closely, they aren't any better than choosing a random key word (say, car) and doing a text search on that. Much more meaningful than those “topics” discovered by topic models that nobody can decipher in practise. So good luck...

Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296)

like image 188
Has QUIT--Anony-Mousse Avatar answered Oct 23 '22 09:10

Has QUIT--Anony-Mousse


I don't think you'll be able to do the clustering adequately from within Elasticsearch. But you can definitely use the clustering results in your ES query.

If I were going to do it, I would use the data you have as input for a clustering algorithm, probably implemented in Apache Spark. I've written a few blog posts about using ES and Spark together (here's one: http://blog.qbox.io/deploy-elasticsearch-and-apache-spark-to-the-cloud). Exactly how to do that is probably outside the scope of a StackOverflow answer, but there are lots of ways to go about it. You certainly don't have to use Spark, of course (I just like it). Pick your favorite programming paradigm to implement clustering, or even use a third-party library. There are plenty out there.

Once I was happy with my clustering results, I would save the cluster meta-data back to ES as a "parent" dataset. So every article would have a parent document representing the cluster to which the article belonged. This relationship could then be used (maybe with a top child query, or has parent or something) to return the results you are wanting.

like image 28
Sloan Ahrens Avatar answered Oct 23 '22 09:10

Sloan Ahrens


Carrot (as mentioned in the question) is very good for clustering the results of a query - it only scales up to 100's or 1000's of documents but that may be enough. If you need larger scales, then methods like locality sensitive hashing avoids the need to calculate all the pairwise distances. Using ES's "more-like-this" could work as a quick-and-dirty alternative to hashing, but would probably need some post-processing.

like image 3
dcorney Avatar answered Oct 23 '22 09:10

dcorney