I'm looking for an algorithm (or some other technique) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I'm reading an RSS feed programatically from Google News. I'm interested in scraping the actual content of the underlying articles. On my first attempt I have the URLs from the RSS feed and I simply follow them and scrape the HTML from that page. This very clearly resulted in a lot of "noise", whether it be HTML tags, headers, navigation, etc. Basically all the information that is unrelated to the actual content of the article.
Now, I understand this is an extremely difficult problem to solve, it would theoretically involve writing a parser for every website out there. What I'm interested in is an algorithm (I'd even settle for an idea) on how to maximize the actual content that I see when I download the article and minimize the amount of noise.
A couple of additional notes:
Any ideas?
As long as you've accepted that fact that whatever you try is going to be very sketchy based on your requirements, I'd recommend you look into Bayesian filtering. This technique has proven to be very effective in filtering spam out of email.
When reading news outside of my RSS reader, I often use Readability to filter out everything but the meat of the article. It is Javascript-based so the technique would not directly apply to your problem, but the algorithm has a high success rate in my experience and is worth a look. Hope this helps.
Take a look at templatemaker (Google code homepage). The basic idea is that you request a few different pages from the same site, then mark down what elements are common across the set of pages. From there you can figure out where the dynamic content is.
Try running diff
on two pages from the same site to get an idea of how it works. The parts of the page that are different are the places where there is dynamic (interesting) content.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With