Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incremental clustering algorithm for grouping news articles?

I'm doing a little research on how to cluster articles into 'news stories' ala Google News.

Looking at previous questions here on the subject, I often see it recommended to simply pull out a vector of words from an article, weight some of the words more if they're in certain parts of the article (e.g. the headline), and then to use something like a k-means algorithm to cluster the articles.

But this leads to a couple of questions:

  • With k-means, how do you know in advance how much k should be? In a dynamic news environment you may have a very variable number of stories, and you won't know in advance how many stories a collection of articles represents.

  • With hierarchal clustering algorithms, how do you decide which clusters to use as your stories? You'll have clusters at the bottom of the tree that are just single articles, which you obviously won't want to use, and a cluster at the root of the tree which has all of the articles, which again you won't want...but how do you know which clusters in between should be used to represent stories?

  • Finally, with either k-means or hierarchal algorithms, most literature I have read seems to assume you have a preset collection of documents you want to cluster, and it clusters them all at once. But what of a situation where you have new articles coming in every so often. What happens? Do you have to cluster all the articles from scratch, now that there's an additional one? This is why I'm wondering if there are approaches that let you 'add' articles as you go without re-clustering from scratch. I can't imagine that's very efficient.

like image 436
Peter Avatar asked Aug 31 '10 18:08

Peter


People also ask

What is incremental clustering algorithm?

The incremental clustering approach is a way to address the dynamically growing dataset. It attempts to minimize the scanning and calculation effort for newly added data points[1]. It is essential to efficiently store and utilize knowledge about the existing clustering results for incremental clustering.

What is clustering process of grouping?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.


2 Answers

I worked on a start-up that built exactly this: an incremental clustering engine for news articles. We based our algorithm on this paper: Web Document Clustering Using Document Index Graph (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=4289851). Worked well for us for 10K articles / day.

It has two main advantages: 1) It's incremental, which addresses the problem you have with having to deal with a stream of incoming articles (rather than clustering all at once) 2) It uses phrase-based modeling, as opposed to just "bag of words", which results in much higher accuracy.

A Google search pops up http://www.similetrix.com, they might have what you're looking for.

like image 123
Octodone Avatar answered Sep 20 '22 07:09

Octodone


I would do a search for adaptive K-means clustering algorithms. There is a good section of research devoted to the problems you describe. Here is one such paper (pdf)

like image 32
Eric LaForce Avatar answered Sep 18 '22 07:09

Eric LaForce