Get text that lies after pattern without class or id

Question

I am using the HtmlAgiityPack.

It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -

string example = doc.DocumentNode.SelectSingleNode("//div[@class='target']").InnerText.Trim();

However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -

<p>Example Header</p>: This is the text I want!<br>

However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.

I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?

har07 · Accepted Answer

This XPath works for me :

var html = @"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();

doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[@class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);

/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)

Result : enter image description here

UPDATE I :

All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :

var html = @"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);

UPDATE II :

Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :

var result = 
        doc.DocumentNode
           .SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
           .OuterHtml;

Get text that lies after pattern without class or id

Tags:

c#

regex

parsing

html-agility-pack

Ebikeneser

1 Answers

har07

Recent Activity

Donate For Us

Get text that lies after pattern without class or id

Tags:

c#

regex

parsing

html-agility-pack

Ebikeneser

1 Answers

har07

Related questions

Recent Activity

Donate For Us