Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.

How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?

Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.

like image 654
Sampson Avatar asked Jun 16 '09 20:06

Sampson


3 Answers

Readability does a decent job of exactly this.

It's open source and posted on Google Code.


UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.

like image 154
Colin Pickard Avatar answered Oct 21 '22 07:10

Colin Pickard


think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.

How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?

I would probably try something like this:

  • open URL
  • read in all links to same website from that page
  • follow all links and build a DOM tree for each URL (HTML file)
  • this should help you come up with redundant contents (included templates and such)
  • compare DOM trees for all documents on same site (tree walking)
  • strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
  • try to identify similar nodes and strip if possible
  • find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
  • add as candidate for further processing

This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.

This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.

like image 24
none Avatar answered Oct 21 '22 06:10

none


Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.

http://www.w3.org/TR/CSS2/media.html

I would try to read this style, and then scrape whatever is left visible.

like image 20
Ian Jacobs Avatar answered Oct 21 '22 07:10

Ian Jacobs