Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Xpath and HtmlAgilityPack to find all elements with innertext containing a specific word or words

I am trying to build a simple search-engine using HtmlAgilityPack and Xpath with C# (.NET 4). I want to find every node containing a userdefined searchword, but I can't seem to get the XPath right. For Example:

<HTML>
 <BODY>
  <H1>Mr T for president</H1>
   <div>We believe the new president should be</div>
   <div>the awsome Mr T</div>
   <div>
    <H2>Mr T replies:</H2>
     <p>I pity the fool who doesn't vote</p>
     <p>for Mr T</p>
   </div>
  </BODY>
</HTML>

If the specified searchword is "Mr T" I'd want the following nodes: <H1>, The second <div>, <H2> and the second <p>. I have tried numerous variants of doc.DocumentNode.SelectNodes("//text()[contains(., "+ searchword +")]"); but I always seem to wind up with every single node in the entire DOM.

Any hints to get me in the right direction would be very appreciated.

like image 622
user1161569 Avatar asked Jan 20 '12 22:01

user1161569


2 Answers

Use:

//*[text()[contains(., 'Mr T')]]

This selects all elements in the XML document that have a text-node child which contains the string 'Mr T'.

This can also be written shorter as:

//text()[contains(., 'Mr T')]/..

This selects the parent(s) of any text node that contains the string 'Mr T'.

like image 162
Dimitre Novatchev Avatar answered Oct 01 '22 15:10

Dimitre Novatchev


According to Xpath, if you want to find a specific keyword you need to follow the format ("keyword" is the word you like to search) :

//*[text()[contains(., 'keyword')]]

You have to follow the same format as above in C#, keyword is the string variable you call:

doc.DocumentNode.SelectNodes("//*[text()[contains(., '" + keyword + "')]]");
like image 25
Eugene Liu Avatar answered Oct 01 '22 15:10

Eugene Liu