Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for reading the actual content of news articles and ignoring "noise" on the page?

I'm looking for an algorithm (or some other technique) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I'm reading an RSS feed programatically from Google News. I'm interested in scraping the actual content of the underlying articles. On my first attempt I have the URLs from the RSS feed and I simply follow them and scrape the HTML from that page. This very clearly resulted in a lot of "noise", whether it be HTML tags, headers, navigation, etc. Basically all the information that is unrelated to the actual content of the article.

Now, I understand this is an extremely difficult problem to solve, it would theoretically involve writing a parser for every website out there. What I'm interested in is an algorithm (I'd even settle for an idea) on how to maximize the actual content that I see when I download the article and minimize the amount of noise.

A couple of additional notes:

  • Scraping the HTML is simply the first attempt I tried. I'm not sold that this is the best way to do things.
  • I don't want to write a parser for every website I come across, I need the unpredictability of accepting whatever Google provides through the RSS feed.
  • I know whatever algorithm I end up with is not going to be perfect, but I'm interested in a best possible solution.

Any ideas?

like image 359
The Matt Avatar asked Sep 20 '09 20:09

The Matt


3 Answers

As long as you've accepted that fact that whatever you try is going to be very sketchy based on your requirements, I'd recommend you look into Bayesian filtering. This technique has proven to be very effective in filtering spam out of email.

like image 195
Bill the Lizard Avatar answered Nov 06 '22 00:11

Bill the Lizard


When reading news outside of my RSS reader, I often use Readability to filter out everything but the meat of the article. It is Javascript-based so the technique would not directly apply to your problem, but the algorithm has a high success rate in my experience and is worth a look. Hope this helps.

like image 32
Chris Ballance Avatar answered Nov 06 '22 00:11

Chris Ballance


Take a look at templatemaker (Google code homepage). The basic idea is that you request a few different pages from the same site, then mark down what elements are common across the set of pages. From there you can figure out where the dynamic content is.

Try running diff on two pages from the same site to get an idea of how it works. The parts of the page that are different are the places where there is dynamic (interesting) content.

like image 2
Steven Kryskalla Avatar answered Nov 06 '22 00:11

Steven Kryskalla