Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath Expression: Select elements between A HREF="expr" tags

I didn't found an explicit way to select all nodes that exist between two anchors (<a></a> tag pair) in an HTML file.

The first anchor has the following format:

<a href="file://START..."></a>

Second anchor:

<a href="file://END..."></a>

I've verified that both can be selected using starts-with (note that I'm using HTML Agility Pack):

HtmlNode n0 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://START')]"));
HtmlNode n1 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://END')]"));

With this in mind, and with my amateurish XPath skills, I wrote the following expression to get all tags between the two anchors:

html.DocumentNode.SelectNodes("//*[not(following-sibling::a[starts-with(@href,'file://START0')]) and not (preceding-sibling::a[starts-with(@href,'file://END0')])]");

This seems to work, but selects all HTML document!

I need to, for example for the following HTML fragment:

<html>
...

<a href="file://START0"></a>
<p>First nodes</p>
<p>First nodes
    <span>X</span>
</p>
<p>First nodes</p>
<a href="file://END0"></a>

...
</html>

remove both anchors, the three P (including of course the inner SPAN).

Any way to do this?

I don't know if XPath 2.0 offers better ways to achieve this.

*EDIT (special case!) *

I should also handle the case where:

"Select tags between X and X', where X is <p><a href="file://..."></a></p>"

So instead of:

<a href="file://START..."></a>
<!-- xhtml to be extracted -->
<a href="file://END..."></a>

I should handle also:

<p>
  <a href="file://START..."></a>
</p>
<!-- xhtml to be extracted -->

<p>
  <a href="file://END..."></a>
</p>

Thank you very much, again.

like image 376
Hernán Avatar asked Jul 02 '11 00:07

Hernán


2 Answers

Use this XPath 1.0 expression:

//a[starts-with(@href,'file://START')]/following-sibling::node()
     [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
     =
      count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
     ]

Or, use this XPath 2.0 expression:

    //a[starts-with(@href,'file://START')]/following-sibling::node()
  intersect
    //a[starts-with(@href,'file://END')]/preceding-sibling::node()

The XPath 2.0 expression uses the XPath 2.0 intersect operator.

The XPath 1.0 expression uses the Kayessian (after @Michael Kay) formula for the intersectioon of two node-sets:

$ns1[count(.|$ns2) = count($ns2)]

Verification with XSLT:

This XSLT 1.0 transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "    //a[starts-with(@href,'file://START')]/following-sibling::node()
         [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
         =
          count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
         ]
  "/>
 </xsl:template>
</xsl:stylesheet>

when applied on the provided XML document:

<html>...
    <a href="file://START0"></a>
    <p>First nodes</p>
    <p>First nodes    
        <span>X</span>
    </p>
    <p>First nodes</p>
    <a href="file://END0"></a>...
</html>

produces the wanted, correct result:

<p>First nodes</p>
<p>First nodes    
        <span>X</span>
</p>
<p>First nodes</p>

This XSLT 2.0 transformation:

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  " //a[starts-with(@href,'file://START')]/following-sibling::node()
   intersect
    //a[starts-with(@href,'file://END')]/preceding-sibling::node()
  "/>
 </xsl:template>
</xsl:stylesheet>

when applied on the same XML document (above) again produces exactly the wanted result.

like image 153
Dimitre Novatchev Avatar answered Oct 17 '22 15:10

Dimitre Novatchev


I've added a special case that I should handle

To handle this special case you can work in the same way, I mean use the Kayessian (and use XPath Visualizer as well ;-)). The intersecting node-sets change as follows:

Intersecting node-set C

    "//p[.//a[starts-with(@href,'file://START')]]
         /following-sibling::node()"

All following sibling of p containing a START.

Intersecting node-set D

"./following-sibling::p[.//a[starts-with(@href,'file://END')]]
    /preceding-sibling::node())"

All preceding siblings of p containing a END and following sibling of current p


Now you can perform the intersection as:

C ∩ D

That is

    "//p[.//a[starts-with(@href,'file://START')]]
            /following-sibling::node()[
            count(.| ./following-sibling::p
                     [.//a[starts-with(@href,'file://END')]]
                       /preceding-sibling::node())
            =
            count(./following-sibling::p
                   [.//a[starts-with(@href,'file://END')]]
                     /preceding-sibling::node())
            ]"

If you need to manage both situations, you can proceed with the union of the intersecting node-sets as

(A ∩ B) ∪ (C ∩ D)

Where:

  • The XPath union operator | must be used:
  • the node-sets A e B are already showed in the @Dimitre'answer
  • the node-sets C e D are those showed in my answer.
like image 2
Emiliano Poggi Avatar answered Oct 17 '22 16:10

Emiliano Poggi