Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get text that lies after pattern without class or id

I am using the HtmlAgiityPack.

It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -

string example = doc.DocumentNode.SelectSingleNode("//div[@class='target']").InnerText.Trim();

However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -

<p>Example Header</p>: This is the text I want!<br>

However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.

I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?

like image 980
Ebikeneser Avatar asked Dec 03 '25 01:12

Ebikeneser


1 Answers

This XPath works for me :

var html = @"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();

doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[@class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
  • /text() select all text nodes that is direct child of the <div>
  • [(normalize-space())] exclude all text nodes those contain only white spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)

Result : enter image description here

UPDATE I :

All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :

var html = @"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);

UPDATE II :

Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :

var result = 
        doc.DocumentNode
           .SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
           .OuterHtml;
like image 139
har07 Avatar answered Dec 05 '25 15:12

har07



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!