Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping similar news contents together like in GOOGLE NEWS

I am unable to manage the RSS feeds easily due to an overwhelming number of new stories / similar news contents posted in various news sites. For subjects such as world news and business news, many of the stories are redundant, adding a burden to readers to sort out which stories they've already read. To deal with the twin problems of flooding and redundancy, i need to develop an code that reduces the number of items to read and uses the overlapping information to divine interesting topics.

it would be easier if i am able to Grouping similar news contents together like in GOOGLE NEWS / StackOverflow and present it to the users.

like image 581
Gourav Avatar asked Oct 18 '10 10:10

Gourav


People also ask

How does Google aggregate news?

Computer algorithms determine what shows up in Google News. The algorithms determine which stories, images, and videos show, and in what order. In some cases, people like publishers and Google News teams choose stories. Google News shows some content in a personalized way.


2 Answers

This is definitely a not-so-easy-to-solve problem that can be solved by:

  • smart text-parsing functions
  • raw hardware power
  • both of them
  • testing, testing, testing
  • fine-tuning at the end

First of all i'd group different news sources to some relatively broad category. You can easily determine a Tech news source won't publish news under economic category. (Or will, that's the problem.)

Most of the cases news title won't be touched, it remains in the original form at the most. So Category, Title, and Publish Date a good starting point to group news into one.

If you detect problems with the methods above you need some fine-tuning under the hood.

Maybe you need to read the whole article and compare two (thousands of) articles word-by-word.

  • There are a lot of stopwords that can distort the comparison, so you'll need to ignore these.
  • You may want define synonyms (J Lo = Jennifer Lopez)

If the raw texts of news are similar (you can define a threshold value) you can compare the other factors again (described above).

Some news sources providing good tagging in the RSS source, maybe you can use this too but not rely on it.

And remember, you'll need a lot of fine-tunings at the start (about 1 year) then you'll be fine.

like image 125
fabrik Avatar answered Oct 17 '22 08:10

fabrik


I read somewhere - but I do not have a reference - that Google News uses a variant of MinHash to detect near-duplicate news posts. And a lot of them are almost identical, coming from a press agency only with minor adaptions by the newspapers.

http://en.wikipedia.org/wiki/MinHash

has a reference and the statement that Google News used a variant of LSH and MinHash:

Das, Abhinandan S. et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th international conference on World Wide Web. ACM

like image 4
Has QUIT--Anony-Mousse Avatar answered Oct 17 '22 09:10

Has QUIT--Anony-Mousse