Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Instapaper-like algorithm

Tags:

html

Does anyone of an algorithm that extracts contents from a webpage? like instapaper?

like image 883
Joey Avatar asked Nov 26 '10 07:11

Joey


2 Answers

There are two steps to what Instapaper does:

  1. Find main content block on the page (excluding headers, footers, menus etc)
  2. From this content block extract and format the text

To find the content block (typically some html block element, like a div containing the key page text content) Instapaper uses an algorithm much like the one used by readability. You can look at the source of readability.js to see what's going on, but at its core it tries to find the area on the page with the highest text/link ratio, although it has some other simple scoring metrics too (e.g. off the top of my head, things like ratio of text to commas, para elements etc) that go into the heuristics.

Once you have identified the root node element, with the relevant content, you'll need to format it, if you want you can just pull the node element containing the text out of the source document and insert it into yours, but in reality you'll probably want to remove existing styles and apply your own, for a standard look and feel. If you want to output as nice text-only you can use Jericho's Renderer.

update1: I should also mention something else Instapaper does - which is follow the 'pagination' links (the "next" or "1", "2", "3" links) of the article to their conclusion, so that a piece that may span many pages in the original will be rendered to you as a single document.

update2 I recently came across this comparison of text extraction algorithms

like image 183
Joel Avatar answered Oct 14 '22 16:10

Joel


there is an open source application that parses the text of an article out from any webpage

https://github.com/jiminoc/goose/wiki

should do the trick

like image 43
James Avatar answered Oct 14 '22 14:10

James