Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText
, but it will give foobarbaz
- I want to get each text, not all at a time.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:
StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
Console.WriteLine(node.InnerText);
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
The specified example for html content:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
will produce the following output:
foo bar baz
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With