I'm learning XPath for web scraping and stumbled across these two XPath examples:
//div[@class="head"][@id="top"]
and
//div[@class='canvas- graph']//a[@href='/accounting.html'][i[@class='icon-usd']]/following-sibling::h4
I wonder what does the div[@class="head"][@id="top"]
mean. Does it mean that the @id=top
property belongs to the div
element? Is it the same as //div[@class="head" and @id="top"]
?
And what does it mean when square brackets are nested inside another as in the second example?
What would the HTML DOM look like for the second xpath expression to match it?
Square brackets delimit predicates†, and predicates filter items††.
You anticipate two ways in which predicates can be combined:
Consecutively: Yes, this is equivalent to logically anding the predicates. So, correct, //div[@class="head"][@id="top"]
is equivalent to //div[@class="head" and @id="top"]
.
Recursively: Yes, XPath allows predicates within predicates (nesting, as you observe).
So, a[@href='/accounting.html'][i[@class='icon-usd']]
filters those a
elements with an @href
attribute value equal to '/accounting.html'
and a child i
element with a @class
attribute value equal to 'icon-usd'
.
Together these composition mechanisms provide a powerful means of building predicates out of more basic conditions.
†Predicate references: XPath 1.0. XPath 3.1.
††Node-sets in XPath 1.0; sequences in XPath 2.0+.
The square braces are called a predicate.
A predicate filters a node-set with respect to an axis to produce a new node-set. For each node in the node-set to be filtered, the PredicateExpr is evaluated with that node as the context node, with the number of nodes in the node-set as the context size, and with the proximity position of the node in the node-set with respect to the axis as the context position; if PredicateExpr evaluates to true for that node, the node is included in the new node-set; otherwise, it is not included.
A PredicateExpr is evaluated by evaluating the Expr and converting the result to a boolean. If the result is a number, the result will be converted to
true
if the number is equal to the context position and will be converted tofalse
otherwise; if the result is not a number, then the result will be converted as if by a call to theboolean
function. Thus a location pathpara[3]
is equivalent topara[position()=3]
.
Inside of the predicate you test whether a condition is true or false as a means of filtering the set if items selected to the left of the predicate. Think of it like a SQL WHERE
clause.
You can choose to put multiple test criteria within a single predicate, or you can have multiple predicates. There may be some advantage from a tuning perspective or for clarity to choose to have multiple predicates vs using and
and multiple tests within a single predicate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With