What is a convenient way to do document clustering with elasticsearch?

Question

I have stored a lot of news articles from RSS feeds from different sources in an elasticsearch index. At the moment when I do a search query, it will return me a lot of similar news articles for one query, because the same news topics gets covered by many RSS sources.

Instead what I would like to do is return only one news article out of a group of articles to the same topic. So I somehow need to recognize, which articles are about the same topic, cluster these documents and return only the "best" article out of such a cluster.

What would be the most convenient way to approach that problem? Can I somehow make use of the elasticsearch more-like-this API? Or is the https://github.com/carrot2/elasticsearch-carrot2 plugin the way to go? Or is there simply no convenient way and I have to implement somehow my own version of http://en.wikipedia.org/wiki/K-means_clustering or http://en.wikipedia.org/wiki/Non-negative_matrix_factorization to cluster my documents?

Has QUIT--Anony-Mousse · Accepted Answer

ES is not particularly useful for clustering. Most clustering algorithms require pairwise distance computations, which is easiest if you can fit all your data into a huge matrix (and then factor it) So it may well be easier (and faster) to work outside ES!
None of the approaches work half as good as advertised. See e.g. “reading tea leaves”. Everybody who constructs such an algorithm is happy to get anything out, and will tune and fiddle parameters and rerun until the result looks nice. The technical term is cherry picking. Evaluation is incredibly sloppy, and if you look at the results closely, they aren't any better than choosing a random key word (say, car) and doing a text search on that. Much more meaningful than those “topics” discovered by topic models that nobody can decipher in practise. So good luck...

Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296)

Sloan Ahrens · Answer

I don't think you'll be able to do the clustering adequately from within Elasticsearch. But you can definitely use the clustering results in your ES query.

If I were going to do it, I would use the data you have as input for a clustering algorithm, probably implemented in Apache Spark. I've written a few blog posts about using ES and Spark together (here's one: http://blog.qbox.io/deploy-elasticsearch-and-apache-spark-to-the-cloud). Exactly how to do that is probably outside the scope of a StackOverflow answer, but there are lots of ways to go about it. You certainly don't have to use Spark, of course (I just like it). Pick your favorite programming paradigm to implement clustering, or even use a third-party library. There are plenty out there.

Once I was happy with my clustering results, I would save the cluster meta-data back to ES as a "parent" dataset. So every article would have a parent document representing the cluster to which the article belonged. This relationship could then be used (maybe with a top child query, or has parent or something) to return the results you are wanting.

dcorney · Answer

Carrot (as mentioned in the question) is very good for clustering the results of a query - it only scales up to 100's or 1000's of documents but that may be enough. If you need larger scales, then methods like locality sensitive hashing avoids the need to calculate all the pairwise distances. Using ES's "more-like-this" could work as a quick-and-dirty alternative to hashing, but would probably need some post-processing.

What is a convenient way to do document clustering with elasticsearch?

Tags:

algorithm

elasticsearch

cluster-analysis

asmaier

3 Answers

Has QUIT--Anony-Mousse

Sloan Ahrens

dcorney

Recent Activity

Donate For Us

What is a convenient way to do document clustering with elasticsearch?

Tags:

algorithm

elasticsearch

cluster-analysis

asmaier

3 Answers

Has QUIT--Anony-Mousse

Sloan Ahrens

dcorney

Related questions

Recent Activity

Donate For Us