Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do double slash in XPath predicate work the same as in the path itself

Tags:

xpath

I played with different XPath queries with XPather (works only with older firefox versions) and notice a difference between the results from the following queries

This one shows some results

//div[descendant::table/descendant::td[4]] 

This one lists empty list

//div[//table//td[4]]

Are they different due to some rules or it's just misbehavior of particular implementation of XPath interpreter? (Seems like used from FF engine, XPather is just an excellent simple GUI for querying)

like image 779
Maksee Avatar asked Apr 07 '12 11:04

Maksee


2 Answers

With XPath 1.0 // is an abbreviation for /descendant-or-self::node()/ so your first path is /descendant-or-self::node()/div[descendant::table/descendant::td[4]] while the second is rather different with /descendant-or-self::node()/div[/descendant-or-self::node()/table/descendant-or-self::node()/td[4]]. So the major difference is that inside your first predicate you look down for descendants relative to the div element while in the second predicate you look down for descendants from the root node / (also called the document node). You might want //div[.//table//td[4]] for the second path expression to come closer to the first one.

[edit] Here is a sample:

<html>
  <body>
    <div>
      <table>
        <tbody>
          <tr>
            <td>1</td>
          </tr>
          <tr>
            <td>2</td>
          </tr>
          <tr>
            <td>3</td>
          </tr>
          <tr>
            <td>4</td>
          </tr>
        </tbody>
      </table>
    </div>
  </body>
</html>

With that sample the path //div[descendant::table/descendant::td[4]] selects the div element as it has a table child which has a fourth td descendant.

However with //div[.//table//td[4]] we look for //div[./descendant-or-self::node()/table/descendant-or-self::node()/td[4]] which is short for //div[./descendant-or-self::node()/table/descendant-or-self::node()/child::td[4]] and there is no element having a fourth td child element.

I hope that explains the difference, if you use //div[.//table/descendant::td[4]] then you should get the same result as with your original form.

like image 190
Martin Honnen Avatar answered Oct 05 '22 21:10

Martin Honnen


There's an important note in W3C document on XPath 1.0 (W3C Recommendation 16 November 1999):

XML Path Language (XPath) Version 1.0
    2 Location Paths
        2.5 Abbreviated Syntax

NOTE: The location path //para[1] does not mean the same as the location path /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their parents.

Simlar note in the document on XPath 3.1 (W3C Recommendation 21 March 2017)

XML Path Language (XPath) 3.1
    3 Expressions
        3.3 Path Expressions
            3.3.5 Abbreviated Syntax

NOTE: The path expression //para[1] does not mean the same as the path expression /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their respective parents.

That means the double slash inside the path is not just a shortcut for /descendant-or-self::node()/ but also a starting point for next level of an XML tree iteration, which implies the step expression to the right of // is re-run on each descendant of the current context node.

So the exact meaning of the predicate in this path

//div[ descendant::table/descendant::td[4] ]

is:

  • build a sequence of all <table> nodes descendant to the current <div>,
  • for every such <table> build a sequence of all descendant <td> elements and concatenate them into a single sequence,
  • filter that sequence for its fourth item.

Finally the path returns all <div> elements in the document, which have at least four data cells in all their nested tables. And since there are tables in the document which have 4 cells or more (including cells in nested tables, of course), the whole expression selects their respective <div> ancestors.

On the other hand the predicate in

//div[ //table//td[4] ]

means:

  • scan the whole document tree for <table> elements (more precisely, test the root node and every root's descendant if it has a <table> child),
  • for every table found scan its subtree for elements having a fourth <td> subelement (i.e. test if the table or any of its descendants has at least four <td> children).

Please note the predicate subexpression does not depend on the context node. It is a global path, resolving to some sequence of nodes (possibly empty), thus the predicate boolean value depends only on the document's structure. If it is true the whole path returns a sequence of all <div> elements in the document, else the empty sequence.

Finally the predicate would be true iff there was an element in any table, having 4 (at least) data cells.
And as far as I can see all <tr> rows contain two or three cells - there is no element with 4 or more <td> children, so the predicate subexpression returns en empty sequence, the predicate is false and the whole path gets filtered out. Result is: nothing (empty sequence).

like image 20
CiaPan Avatar answered Oct 05 '22 22:10

CiaPan