Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract Article Text contents from HTML page like Pocket (Read It Later) or Readability? [closed]

I am looking for some open source framework or algorithm to extract article text contents from any HTML page by cleaning the HTML code, removing garbage stuff, similar to what Pocket (aka Read It Later) software does.

Pocket official webpage: http://getpocket.com/

This question is already available under link: How to extract text contents from html like Read it later or InstaPaper Iphone app? but my requirement is bit different. I want to clean the HTML and extract main contents with images by preserving the font and style (CSS).

like image 999
Furqan Safdar Avatar asked Sep 02 '12 19:09

Furqan Safdar


2 Answers

I would recommend NReadability, together with HtmlAgilityPack

Main text is always in div with id readInner after NReadability transcoded the page.

//** replace this with any url **
string url = "http://www.bbc.co.uk/news/world-asia-19457334";

var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);

if (b)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(page);

    var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
    var imgUrl = doc.DocumentNode.SelectSingleNode("//meta[@property='og:image']").Attributes["content"].Value;
    var mainText = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']").InnerText;
}
like image 194
L.B Avatar answered Oct 19 '22 11:10

L.B


Use the HTML Agilty Pack - it is an open source HTML parser for .NET.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

You can use this to query HTML and extract whatever data you wish.

like image 34
Oded Avatar answered Oct 19 '22 10:10

Oded