I tried HtmlAgilityPack and the following code, but it does not capture text from html lists:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlStr);
HtmlNode node = doc.DocumentNode;
return node.InnerText;
Here is the code that fails:
<as html>
<p>This line is picked up <b>correctly</b>. List items hasn't...</p>
<p><ul>
<li>List Item 1</li>
<li>List Item 2</li>
<li>List Item 3</li>
<li>List Item 4</li>
</ul></p>
</as html>
Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.
Because you need walk over tree and concat in some way InnerText
of all nodes
Following piece of code works for me:
string StripHTML(string htmlStr)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlStr);
var root = doc.DocumentNode;
string s = "";
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
s += text.Trim() + " ";
}
}
return s.Trim();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With