Programmatically detecting "most important content" on a page

Question

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.

How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?

Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.

Colin Pickard · Accepted Answer

Readability does a decent job of exactly this.

It's open source and posted on Google Code.

UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.

none · Answer

think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.

How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?

I would probably try something like this:

open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing

This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.

This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.

Ian Jacobs · Answer

Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.

http://www.w3.org/TR/CSS2/media.html

I would try to read this style, and then scrape whatever is left visible.

http://www.w3.org/TR/CSS2/media.html

I would try to read this style, and then scrape whatever is left visible.

Programmatically detecting "most important content" on a page

Tags:

language-agnostic

design-patterns

screen-scraping

Sampson

3 Answers

Colin Pickard

none

Ian Jacobs

Recent Activity

Donate For Us

Programmatically detecting "most important content" on a page

Tags:

language-agnostic

design-patterns

screen-scraping

Sampson

3 Answers

Colin Pickard

none

Ian Jacobs

Related questions

Recent Activity

Donate For Us