Does anyone of an algorithm that extracts contents from a webpage? like instapaper?
There are two steps to what Instapaper does:
To find the content block (typically some html block element, like a div containing the key page text content) Instapaper uses an algorithm much like the one used by readability. You can look at the source of readability.js to see what's going on, but at its core it tries to find the area on the page with the highest text/link ratio, although it has some other simple scoring metrics too (e.g. off the top of my head, things like ratio of text to commas, para elements etc) that go into the heuristics.
Once you have identified the root node element, with the relevant content, you'll need to format it, if you want you can just pull the node element containing the text out of the source document and insert it into yours, but in reality you'll probably want to remove existing styles and apply your own, for a standard look and feel. If you want to output as nice text-only you can use Jericho's Renderer.
update1: I should also mention something else Instapaper does - which is follow the 'pagination' links (the "next" or "1", "2", "3" links) of the article to their conclusion, so that a piece that may span many pages in the original will be rendered to you as a single document.
update2 I recently came across this comparison of text extraction algorithms
there is an open source application that parses the text of an article out from any webpage
https://github.com/jiminoc/goose/wiki
should do the trick
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With