Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath to find all following siblings up until the next sibling of a particular type

Given this XML/HTML:

<dl>
  <dt>Label1</dt><dd>Value1</dd>
  <dt>Label2</dt><dd>Value2</dd>
  <dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
  <dt>Label4</dt><dd>Value4</dd>
</dl>

I want to find all <dt> and then, for each, find the following <dd> up until the next <dt>.

Using Ruby's Nokogiri I am able to accomplish this like so:

dl.xpath('dt').each do |dt|
  ct  = dt.xpath('count(following-sibling::dt)')
  dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
  puts "#{dt.text}: #{dds.map(&:text).join(', ')}"
end
#=> Label1: Value1
#=> Label2: Value2
#=> Label3: Value3a, Value3b
#=> Label4: Value4

However, as you can see I'm creating a variable in Ruby and then composing an XPath using it. How can I write a single XPath expression that does the equivalent?

I guessed at:

following-sibling::dd[count(following-sibling::dt)=count(self/following-sibling::dt)]

but apparently I don't understand what self means there.

This question is similar to XPath : select all following siblings until another sibling except there is no unique identifier for the 'stop' node.

This question is almost the same as xpath to find all following sibling adjacent nodes up til another type except that I'm asking for an XPath-only solution.

like image 633
Phrogz Avatar asked Jan 19 '23 01:01

Phrogz


2 Answers

This is an interesting question. Most of the problems were already mentioned in @lwburk's answer and in its comments. Just to open up a bit more the complexity hidden in this question for a random reader, my answer is probably more elaborate or more verbose than OP needed.

Features of XPath 1.0 related to this problem

In XPath each step, and each node in the set of selected nodes, work independently. This means that

  1. a subexpression has no generic way to access data that was computed in a previous subexpression or share data computed in this subexpression to other subexpressions
  2. a node has no generic way to refer to a node that was used as a context node in a previous subexpression
  3. a node has no generic way to refer to other nodes that are currently selected.
  4. if everyone of the selected nodes must be compared to a same certain node, then that node must be uniquely definable in a way that is common to all selected nodes

(Well, in fact I'm not 100% sure if that list is absolutely correct in every case. If anyone has better knowledge of the quirks of XPath, please comment or correct this answer by editing it.)

Despite the lack of generic solutions some of these restrictions can be overcome if there is proper knowledge of the document structure, and/or the axis used previously can be "reverted" with another axis that serves as a backlink i.e. matches only nodes that were used as context node in the previous expression. A common example of this is when a parent axis is used after first using a child axis (the opposite case, from child to parent, is not uniquely revertible without additional information). In such cases, the information from previous steps is more precisely recreated at a later step (instead of accessing previously known information).

Unfortunately in this case I couldn't come up with any other solution to refer to previously known nodes except using XPath variables (that needs to be defined beforehand).

XPath specifies a syntax for referring a variable but it does not specify syntax for defining variables, the way how to define variables depends on the environment where XPath is used. Actually since the recommendation states that "The variable bindings used to evaluate a subexpression are always the same as those used to evaluate the containing expression", you could also claim that XPath explicitly forbids defining variables inside an XPath expression.

Problem reformulated

In your question the problem would be, when given a <dt>, to identify the following <dd> elements or the initially given node after the context node has been switched. Identifying the originally given <dt> is crucial since for each node in the node-set to be filtered, the predicate expression is evaluated with that node as the context node; so one cannot refer to the original <dt> in a predicate, if there is no way to identify it after the context has changed. The same applies to <dd> elements that are following siblings of the given <dt>.

If you are using variables, one could debate is there a major difference between 1) using XPath variable syntax and a Nokogiri specific way to declare that variable or 2) using Nokogiri extended XPath syntax that allows you to use Ruby variables in an XPath expression. In both cases the variable is defined in environment specific way and the meaning of the XPath is clear only if the definition of the variable is also available. Similar case can be seen with XSLT where in some cases you could make a choice between 1) defining a variable with <xsl:variable> prior to using your XPath expression or 2) using current() (inside your XPath expression) which is an XSLT extension.

Solution using nodeset variables and Kaysian method

You can select all the <dd> elements following the current <dt> element with following-sibling::dd (set A). Also you can select all the <dd> elements following the next <dt> element with following-sibling::dt[1]/following-sibling::dd (set B). Now a set difference A\B leaves the <dd> elements you actually wanted (elements that are in set A but not in set B). If variable $setA contains nodeset A and variable $setB contains nodeset B, the set difference can be obtained with (a modification of) Kaysian technique:

dds = $setA[count(.|$setB) != count($setB)]

A simple workaround without any variables

Currently your method is to select all the <dt> elements and then try to couple the value of each such element with values of corresponding <dd> elements in a single operation. Would it be possible to convert that coupling logic to work the other way round? So you would first select all <dd> elements and then for each <dd> find the corresponding <dt>. This would mean that you end up accessing same <dt> elements several times and with every operation you add only one new <dd> value. This could affect performance and the Ruby code could be more complicated.

The good side is the simplicity of the required XPath. When given a <dd> element, finding the corresponding <dt> is amazingly simple: preceding-sibling::dt[1]

As applied to your current Ruby code

dl.xpath('dd').each do |dd|
  dt = dd.xpath("preceding-sibling::dt[1]")
  ## Insert new Ruby magic here ##
end
like image 166
jasso Avatar answered Feb 12 '23 00:02

jasso


One possible solution:

dl.xpath('dt').each_with_index do |dt, i|
  dds = dt.xpath("following-sibling::dd[not(../dt[#{i + 2}]) or " +
                     "following-sibling::dt[1]=../dt[#{i + 2}]]")
  puts "#{dt.text}: #{dds.map(&:text).join(', ')}"
end

This relies on a value comparison of dt elements and will fail when there are duplicates. The following (much more complicated) expression does not depend on unique dt values:

following-sibling::dd[not(../dt[$n]) or 
    (following-sibling::dt[1] and count(following-sibling::dt[1]|../dt[$n])=1)]

Note: Your use of self fails because you're not properly using it as an axis (self::). Also, self always contains just the context node, so it would refer to each dd inspected by the expression, not back to the original dt

like image 27
Wayne Avatar answered Feb 11 '23 23:02

Wayne