Incremental clustering algorithm for grouping news articles?

Tags:

cluster-analysis

I'm doing a little research on how to cluster articles into 'news stories' ala Google News.

Looking at previous questions here on the subject, I often see it recommended to simply pull out a vector of words from an article, weight some of the words more if they're in certain parts of the article (e.g. the headline), and then to use something like a k-means algorithm to cluster the articles.

But this leads to a couple of questions:

With k-means, how do you know in advance how much k should be? In a dynamic news environment you may have a very variable number of stories, and you won't know in advance how many stories a collection of articles represents.
With hierarchal clustering algorithms, how do you decide which clusters to use as your stories? You'll have clusters at the bottom of the tree that are just single articles, which you obviously won't want to use, and a cluster at the root of the tree which has all of the articles, which again you won't want...but how do you know which clusters in between should be used to represent stories?
Finally, with either k-means or hierarchal algorithms, most literature I have read seems to assume you have a preset collection of documents you want to cluster, and it clusters them all at once. But what of a situation where you have new articles coming in every so often. What happens? Do you have to cluster all the articles from scratch, now that there's an additional one? This is why I'm wondering if there are approaches that let you 'add' articles as you go without re-clustering from scratch. I can't imagine that's very efficient.

436

asked Aug 31 '10 18:08

Peter

2 Answers

I worked on a start-up that built exactly this: an incremental clustering engine for news articles. We based our algorithm on this paper: Web Document Clustering Using Document Index Graph (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=4289851). Worked well for us for 10K articles / day.

It has two main advantages: 1) It's incremental, which addresses the problem you have with having to deal with a stream of incoming articles (rather than clustering all at once) 2) It uses phrase-based modeling, as opposed to just "bag of words", which results in much higher accuracy.

A Google search pops up http://www.similetrix.com, they might have what you're looking for.

123

answered Sep 20 '22 07:09

Octodone

I would do a search for adaptive K-means clustering algorithms. There is a good section of research devoted to the problems you describe. Here is one such paper (pdf)

answered Sep 18 '22 07:09

Eric LaForce

Related questions
                            
                                Graph Theory: Calculating Clustering Coefficient
                            
                                Cosine distance as vector distance function for k-means
                            
                                Extract labels membership / classification from a cut dendrogram in R (i.e.: a cutree function for dendrogram)
                            
                                How to use NLP to separate a unstructured text content into distinct paragraphs?
                            
                                Weka simple K-means clustering assignments
                            
                                How to get Agglomerative Clustering "Centroid" in python Scikit-learn
                            
                                How to Bound the Outer Area of Voronoi Polygons and Intersect with Map Data
                            
                                Clustering of news articles
                            
                                Efficient k-means evaluation with silhouette score in sklearn
                            
                                Interest and location based algorithm for android mobile app
                            
                                R: How to overlay pie charts on 'dots' in a scatterplot in R
                            
                                How to identify Cluster labels in kmeans scikit learn
                            
                                What is a convenient way to do document clustering with elasticsearch?
                            
                                DBSCAN with custom metric
                            
                                How to specify distance metric while for kmeans in R?
                            
                                An understandable clusterization
                            
                                Approaches for spatial geodesic latitude longitude clustering in R with geodesic or great circle distances
                            
                                How to print result of clustering in sklearn
                            
                                Clustering Lat/Longs in a Database
                            
                                overplot multiple sets of data with hexbin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With