Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract text visible on a page from its html source?

Tags:

html

c#

I tried HtmlAgilityPack and the following code, but it does not capture text from html lists:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlStr);
HtmlNode node = doc.DocumentNode;
return node.InnerText;

Here is the code that fails:

<as html>
<p>This line is picked up <b>correctly</b>.  List items hasn't...</p>
<p><ul>
<li>List Item 1</li>
<li>List Item 2</li>
<li>List Item 3</li> 
<li>List Item 4</li>
</ul></p>
</as html>
like image 349
Luke G Avatar asked Feb 05 '12 22:02

Luke G


People also ask

How do I get text from HTML page?

Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.


2 Answers

Because you need walk over tree and concat in some way InnerText of all nodes

like image 95
Svisstack Avatar answered Oct 20 '22 00:10

Svisstack


Following piece of code works for me:

string StripHTML(string htmlStr)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(htmlStr);
    var root = doc.DocumentNode;
    string s = "";
    foreach (var node in root.DescendantNodesAndSelf())
    {
        if (!node.HasChildNodes)
        {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
            s += text.Trim() + " ";                     
        }
    }
    return s.Trim();
}
like image 40
Luke G Avatar answered Oct 19 '22 22:10

Luke G