I didn't found an explicit way to select all nodes that exist between two anchors (<a></a>
tag pair) in an HTML file.
The first anchor has the following format:
<a href="file://START..."></a>
Second anchor:
<a href="file://END..."></a>
I've verified that both can be selected using starts-with (note that I'm using HTML Agility Pack):
HtmlNode n0 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://START')]"));
HtmlNode n1 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://END')]"));
With this in mind, and with my amateurish XPath skills, I wrote the following expression to get all tags between the two anchors:
html.DocumentNode.SelectNodes("//*[not(following-sibling::a[starts-with(@href,'file://START0')]) and not (preceding-sibling::a[starts-with(@href,'file://END0')])]");
This seems to work, but selects all HTML document!
I need to, for example for the following HTML fragment:
<html>
...
<a href="file://START0"></a>
<p>First nodes</p>
<p>First nodes
<span>X</span>
</p>
<p>First nodes</p>
<a href="file://END0"></a>
...
</html>
remove both anchors, the three P (including of course the inner SPAN).
Any way to do this?
I don't know if XPath 2.0 offers better ways to achieve this.
*EDIT (special case!) *
I should also handle the case where:
"Select tags between X and X', where X is <p><a href="file://..."></a></p>
"
So instead of:
<a href="file://START..."></a>
<!-- xhtml to be extracted -->
<a href="file://END..."></a>
I should handle also:
<p>
<a href="file://START..."></a>
</p>
<!-- xhtml to be extracted -->
<p>
<a href="file://END..."></a>
</p>
Thank you very much, again.
Use this XPath 1.0 expression:
//a[starts-with(@href,'file://START')]/following-sibling::node()
[count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
=
count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
]
Or, use this XPath 2.0 expression:
//a[starts-with(@href,'file://START')]/following-sibling::node()
intersect
//a[starts-with(@href,'file://END')]/preceding-sibling::node()
The XPath 2.0 expression uses the XPath 2.0 intersect
operator.
The XPath 1.0 expression uses the Kayessian (after @Michael Kay) formula for the intersectioon of two node-sets:
$ns1[count(.|$ns2) = count($ns2)]
Verification with XSLT:
This XSLT 1.0 transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
" //a[starts-with(@href,'file://START')]/following-sibling::node()
[count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
=
count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
]
"/>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML document:
<html>...
<a href="file://START0"></a>
<p>First nodes</p>
<p>First nodes
<span>X</span>
</p>
<p>First nodes</p>
<a href="file://END0"></a>...
</html>
produces the wanted, correct result:
<p>First nodes</p>
<p>First nodes
<span>X</span>
</p>
<p>First nodes</p>
This XSLT 2.0 transformation:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
" //a[starts-with(@href,'file://START')]/following-sibling::node()
intersect
//a[starts-with(@href,'file://END')]/preceding-sibling::node()
"/>
</xsl:template>
</xsl:stylesheet>
when applied on the same XML document (above) again produces exactly the wanted result.
I've added a special case that I should handle
To handle this special case you can work in the same way, I mean use the Kayessian (and use XPath Visualizer as well ;-)). The intersecting node-sets change as follows:
Intersecting node-set C
"//p[.//a[starts-with(@href,'file://START')]]
/following-sibling::node()"
All following sibling of p
containing a
START.
Intersecting node-set D
"./following-sibling::p[.//a[starts-with(@href,'file://END')]]
/preceding-sibling::node())"
All preceding siblings of p
containing a
END and following sibling of current p
Now you can perform the intersection as:
That is
"//p[.//a[starts-with(@href,'file://START')]]
/following-sibling::node()[
count(.| ./following-sibling::p
[.//a[starts-with(@href,'file://END')]]
/preceding-sibling::node())
=
count(./following-sibling::p
[.//a[starts-with(@href,'file://END')]]
/preceding-sibling::node())
]"
If you need to manage both situations, you can proceed with the union of the intersecting node-sets as
Where:
|
must be used:If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With