I am unable to manage the RSS feeds easily due to an overwhelming number of new stories / similar news contents posted in various news sites. For subjects such as world news and business news, many of the stories are redundant, adding a burden to readers to sort out which stories they've already read. To deal with the twin problems of flooding and redundancy, i need to develop an code that reduces the number of items to read and uses the overlapping information to divine interesting topics.
it would be easier if i am able to Grouping similar news contents together like in GOOGLE NEWS / StackOverflow and present it to the users.
Computer algorithms determine what shows up in Google News. The algorithms determine which stories, images, and videos show, and in what order. In some cases, people like publishers and Google News teams choose stories. Google News shows some content in a personalized way.
This is definitely a not-so-easy-to-solve problem that can be solved by:
First of all i'd group different news sources to some relatively broad category. You can easily determine a Tech news source won't publish news under economic category. (Or will, that's the problem.)
Most of the cases news title won't be touched, it remains in the original form at the most. So Category, Title, and Publish Date a good starting point to group news into one.
If you detect problems with the methods above you need some fine-tuning under the hood.
Maybe you need to read the whole article and compare two (thousands of) articles word-by-word.
If the raw texts of news are similar (you can define a threshold value) you can compare the other factors again (described above).
Some news sources providing good tagging in the RSS source, maybe you can use this too but not rely on it.
And remember, you'll need a lot of fine-tunings at the start (about 1 year) then you'll be fine.
I read somewhere - but I do not have a reference - that Google News uses a variant of MinHash to detect near-duplicate news posts. And a lot of them are almost identical, coming from a press agency only with minor adaptions by the newspapers.
http://en.wikipedia.org/wiki/MinHash
has a reference and the statement that Google News used a variant of LSH and MinHash:
Das, Abhinandan S. et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th international conference on World Wide Web. ACM
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With