Given this XML/HTML:
<dl>
<dt>Label1</dt><dd>Value1</dd>
<dt>Label2</dt><dd>Value2</dd>
<dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
<dt>Label4</dt><dd>Value4</dd>
</dl>
I want to find all <dt>
and then, for each, find the following <dd>
up until the next <dt>
.
Using Ruby's Nokogiri I am able to accomplish this like so:
dl.xpath('dt').each do |dt|
ct = dt.xpath('count(following-sibling::dt)')
dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
puts "#{dt.text}: #{dds.map(&:text).join(', ')}"
end
#=> Label1: Value1
#=> Label2: Value2
#=> Label3: Value3a, Value3b
#=> Label4: Value4
However, as you can see I'm creating a variable in Ruby and then composing an XPath using it. How can I write a single XPath expression that does the equivalent?
I guessed at:
following-sibling::dd[count(following-sibling::dt)=count(self/following-sibling::dt)]
but apparently I don't understand what self
means there.
This question is similar to XPath : select all following siblings until another sibling except there is no unique identifier for the 'stop' node.
This question is almost the same as xpath to find all following sibling adjacent nodes up til another type except that I'm asking for an XPath-only solution.
This is an interesting question. Most of the problems were already mentioned in @lwburk's answer and in its comments. Just to open up a bit more the complexity hidden in this question for a random reader, my answer is probably more elaborate or more verbose than OP needed.
In XPath each step, and each node in the set of selected nodes, work independently. This means that
(Well, in fact I'm not 100% sure if that list is absolutely correct in every case. If anyone has better knowledge of the quirks of XPath, please comment or correct this answer by editing it.)
Despite the lack of generic solutions some of these restrictions can be overcome if there is proper knowledge of the document structure, and/or the axis used previously can be "reverted" with another axis that serves as a backlink i.e. matches only nodes that were used as context node in the previous expression. A common example of this is when a parent
axis is used after first using a child
axis (the opposite case, from child to parent, is not uniquely revertible without additional information). In such cases, the information from previous steps is more precisely recreated at a later step (instead of accessing previously known information).
Unfortunately in this case I couldn't come up with any other solution to refer to previously known nodes except using XPath variables (that needs to be defined beforehand).
XPath specifies a syntax for referring a variable but it does not specify syntax for defining variables, the way how to define variables depends on the environment where XPath is used. Actually since the recommendation states that "The variable bindings used to evaluate a subexpression are always the same as those used to evaluate the containing expression", you could also claim that XPath explicitly forbids defining variables inside an XPath expression.
In your question the problem would be, when given a <dt>
, to identify the following <dd>
elements or the initially given node after the context node has been switched. Identifying the originally given <dt>
is crucial since for each node in the node-set to be filtered, the predicate expression is evaluated with that node as the context node; so one cannot refer to the original <dt>
in a predicate, if there is no way to identify it after the context has changed. The same applies to <dd>
elements that are following siblings of the given <dt>
.
If you are using variables, one could debate is there a major difference between 1) using XPath variable syntax and a Nokogiri specific way to declare that variable or 2) using Nokogiri extended XPath syntax that allows you to use Ruby variables in an XPath expression. In both cases the variable is defined in environment specific way and the meaning of the XPath is clear only if the definition of the variable is also available. Similar case can be seen with XSLT where in some cases you could make a choice between 1) defining a variable with <xsl:variable>
prior to using your XPath expression or 2) using current()
(inside your XPath expression) which is an XSLT extension.
You can select all the <dd>
elements following the current <dt>
element with following-sibling::dd
(set A). Also you can select all the <dd>
elements following the next <dt>
element with following-sibling::dt[1]/following-sibling::dd
(set B). Now a set difference A\B
leaves the <dd>
elements you actually wanted (elements that are in set A but not in set B). If variable $setA
contains nodeset A and variable $setB
contains nodeset B, the set difference can be obtained with (a modification of) Kaysian technique:
dds = $setA[count(.|$setB) != count($setB)]
Currently your method is to select all the <dt>
elements and then try to couple the value of each such element with values of corresponding <dd>
elements in a single operation. Would it be possible to convert that coupling logic to work the other way round? So you would first select all <dd>
elements and then for each <dd>
find the corresponding <dt>
. This would mean that you end up accessing same <dt>
elements several times and with every operation you add only one new <dd>
value. This could affect performance and the Ruby code could be more complicated.
The good side is the simplicity of the required XPath. When given a <dd>
element, finding the corresponding <dt>
is amazingly simple: preceding-sibling::dt[1]
As applied to your current Ruby code
dl.xpath('dd').each do |dd|
dt = dd.xpath("preceding-sibling::dt[1]")
## Insert new Ruby magic here ##
end
One possible solution:
dl.xpath('dt').each_with_index do |dt, i|
dds = dt.xpath("following-sibling::dd[not(../dt[#{i + 2}]) or " +
"following-sibling::dt[1]=../dt[#{i + 2}]]")
puts "#{dt.text}: #{dds.map(&:text).join(', ')}"
end
This relies on a value comparison of dt
elements and will fail when there are duplicates. The following (much more complicated) expression does not depend on unique dt
values:
following-sibling::dd[not(../dt[$n]) or
(following-sibling::dt[1] and count(following-sibling::dt[1]|../dt[$n])=1)]
Note: Your use of self
fails because you're not properly using it as an axis (self::
). Also, self
always contains just the context node, so it would refer to each dd
inspected by the expression, not back to the original dt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With