Recently I needed to evaluate an XQuery on the Node of an HTML document. Basically, I needed to select all elements with an href attribute from the first child of the body element. I've added a slight example to explain:
<html>
<body>
<a href="http://www.google.be"/>
</body>
</html>
The desired extraction result is in this case obviously:
<a href="http://www.google.be"/>
My first idea was to use //body/*[1]//*[@href]
because:
//body
matches the body element, wherever it is/*[1]
matches the first child of the body element//*[@href]
matches all descendants or self of the current elementI figured that would work but on the example provided, the XQuery gives no results.
However, I read up a bit and found the following (source: http://www.keller.com/xslt/8/):
Alternate notation for "//": descendant-or-self::node()
So I changed my XQuery to //body/*[1]/descendant-or-self::node()[@href]
and this time, the results were correct.
My question: what is the difference between // and descendant-or-self::node()? What I found here (What's the difference between //node and /descendant::node in xpath?) and here (http://www.w3.org/TR/xpath/#axes) says:
//
is short for/descendant-or-self::node()/
. For example,//para
is short for/descendant-or-self::node()/child::para
.
Which leads me to conclude that //
and /descendant-or-self::node()
are not interchangeable (probably because of the terminating /
at the end?), but then can someone tell me if there is a shorthand for /descendant-or-self::node()
?
Your first XPath expression (//body/*[1]//*[@href]
) actually represents what you described in natrual language: //body/*[1]
is the first child of the body element, and //*[@href]
selects the first element (below) having an @href
attribute.
In your example, there is no element below the anchor tag having such an attribute. Fore xample, this query would match
<html>
<body>
<p>
<a href="http://www.google.be"/>
</p>
</body>
</html>
The non-abbreviated version of this query is:
//body/*[1]/descendant-or-self::node()/*[@href]
Putting your second query in contrast, the problem should be easy to see:
//body/*[1]/descendant-or-self::node()[@href]
I think the problem is in your description, it does not appear to match your example!
Given the input:
<html>
<body>
<a href="http://www.google.be"/>
</body>
</html>
and the requirements statement:
"all elements with an href attribute from the first child of the body element"
Your XPath formulation of:
//body/*[1]//*[@href]
matches your requirements statement. But, the expected output would be an empty sequence, exactly as you have found... and NOT the output you suggested:
<a href="http://www.google.be"/>
To get the suggested output, your XPath requirements statement would instead perhaps be:
"the first child of the body element with an href attribute", which would lead to the XPath:
//*[@href][parent::body][1]
From your requirements statement and the mismatched example, it is hard to be sure exactly what you meant. So perhaps your requirements statement is:
"the first element in the body with a href attribute"
If that is the case, then I would suggest the XPath:
($input//*[@href][ancestor::body])[1]
Note that the sequence constructor, i.e. the '(' and ')' flattens the descendant sequence(s) to allow you to address each selected descendant in a manner similar to an array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With