Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get array of elements, including missing elements, using XPath in XSLT?

Given the following XML-compliant HTML:

<div>
 <a>a1</a>
 <b>b1</b>
</div>

<div>
 <b>b2</b>
</div>

<div>
 <a>a3</a>
 <b>b3</b>
 <c>c3</c>
</div>

doing //a will return:

[a1,a3]

The problem with above is that the third column data is now in second place, when A is not found it is completely skipped.

how can you express an xpath to get all A elements which will return:

[a1, null, a3]

same case for //c, I wonder if it's possible to get

[null, null, c3]

UPDATE: consider another scenario where are no common parents <div>.

<h1>heading1</h1>
 <a>a1</a>
 <b>b1</b>


<h1>heading2</h1>
 <b>b2</b>


<h1>heading3</h1>
 <a>a3</a>
 <b>b3</b>
 <c>c3</c>

UPDATE: I am now able to use XSLT as well.

like image 301
KJW Avatar asked Mar 10 '12 17:03

KJW


3 Answers

There is no null value in XPath. There's a semi-related question here which also explains this: http://www.velocityreviews.com/forums/t686805-xpath-query-to-return-null-values.html

Realistically, you've got three options:

  1. Don't use XPath at all.
  2. Use this: //a | //div[not(a)], which would return the div element if there was no a within it, and have your Java code handle any div's returned as 'no a element present'. Depending on the context, this may even allow you to output something more useful if required, as you'll have access to the entire contents of the div, for example an error 'no a element found in div (some identifier)'.
  3. Preprocess your XML with an XSLT that inserts a elements in any div element that does not already have one with a suitable default.

Your second case is a little tricky, and to be honest, I'd actually recommend not using XPath for it at all, but it can be done:

//a | //h1[not(following-sibling::a) or generate-id(.) != generate-id(following-sibling::a[1]/preceding-sibling::h1[1])]

This will match any a elements, or any h1 elements where no following a element exists before the next h1 element, or the end of the document. As Dimitre pointed out though, this only works if you're using it from within XSLT, as generate-id is an XSLT function.

If you're not using it from within XLST, you can use this rather contrived formula:

//a | //h1[not(following-sibling::a) or count(. | preceding-sibling::h1) != count(following-sibling::a[1]/preceding-sibling::h1)]

It works by matching h1 elements where the count of itself and all preceding h1 elements is not the same as the count of all h1 elements preceding the next a. There may be a more efficient way of doing it in XPath, but if it's going to get any more contrived than that, I'd definitely recommend not using XPath at all.

like image 121
Flynn1179 Avatar answered Sep 30 '22 15:09

Flynn1179


Solution for the first problem:

This XPath expression:

    /*/div/a
|
    /*/div[not(a)]

When evaluated against the following XML document:

<t>
    <div>
        <a>a1</a>
        <b>b1</b>
    </div>
    <div>
        <b>b2</b>
    </div>
    <div>
        <a>a3</a>
        <b>b3</b>
        <c>c3</c>
    </div>
</t>

selects the following three nodes (a, div, a):

<a>a1</a>
<div>
    <b>b2</b>
</div>
<a>a3</a>

In your java array any selected non-a element should be treated as (or replaced by) null.


Here is one solution to the second problem:

Use these XPath expressions for selecting the a elements from each group:

For the first group:

/*/h1[1]
   /following-sibling::a
      [not(/*/h1[2])
     or
       count(.|/*/h1[2]/preceding-sibling::a)
      =
       count(/*/h1[2]/preceding-sibling::a)
      ]

For the second group:

/*/h1[2]
   /following-sibling::a
      [not(/*/h1[3])
     or
       count(.|/*/h1[3]/preceding-sibling::a)
      =
       count(/*/h1[3]/preceding-sibling::a)
      ]

And for the 3rd group:

/*/h1[3]
   /following-sibling::a
      [not(/*/h1[4])
     or
      count(.|/*/h1[4]/preceding-sibling::a)
      =
       count(/*/h1[4]/preceding-sibling::a)
      ]

In case that:

count(/*/h1)

is $cnt,

generate $cnt such expressions (for i = 1 to $cnt) and evaluate all of them. The selected nodes by each of them either contains an a element, or not. If the $k-th group (nodes selected from evaluating the $k-th expression) contains an a, use its string value to generate the $k-th item of the wanted array -- otherwise generate null for the $k-th item of the wanted array.

Here is an XSLT - based verification of the above XPath expressions:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
   <xsl:variable name="vGroup1" select=
   "/*/h1[1]
       /following-sibling::a
          [not(/*/h1[2])
         or
           count(.|/*/h1[2]/preceding-sibling::a)
          =
           count(/*/h1[2]/preceding-sibling::a)
          ]
   "/>

   <xsl:variable name="vGroup2" select=
   "/*/h1[2]
       /following-sibling::a
          [not(/*/h1[3])
         or
           count(.|/*/h1[3]/preceding-sibling::a)
          =
           count(/*/h1[3]/preceding-sibling::a)
          ]
   "/>

   <xsl:variable name="vGroup3" select=
   "/*/h1[3]
       /following-sibling::a
          [not(/*/h1[4])
         or
          count(.|/*/h1[4]/preceding-sibling::a)
          =
           count(/*/h1[4]/preceding-sibling::a)
          ]
   "/>

 Group1:  "<xsl:copy-of select="$vGroup1"/>"

 Group2:  "<xsl:copy-of select="$vGroup2"/>"

 Group3:  "<xsl:copy-of select="$vGroup3"/>"

 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the following XML document (no complete and well-formed XML document has been provided by the OP !!!):

<t>
    <h1>heading1</h1>
    <a>a1</a>
    <b>b1</b>

    <h1>heading2</h1>
    <b>b2</b>

    <h1>heading3</h1>
    <a>a3</a>
    <b>b3</b>
    <c>c3</c>
</t>

the three XPath expressions are evaluated and the selected nodes by each of them are output:

 Group1:  "<a>a1</a>"

 Group2:  ""

 Group3:  "<a>a3</a>"

Explanation:

We use the well-known Kayessian formula for the intersection of two nodesets:

$ns1[count(. | $ns2) = count($ns2)]

The result of evaluating this expression contains exactly the nodes that belong both to the nodeset $ns1 and the nodeset $ns2.

What remains is to substitute $ns1 and $ns2 with expressions that are relevant to the problem.

We substitute $ns1 by:

/*/h1[1]
    /following-sibling::a

and we substitute $ns2 by:

/*/h1[2]
    /preceding-sibling::a

In other words, the a elements that are between the first and second /*/h1 are the intersection of the a elements that are following siblings of /*/h1[1] and the a elements that are preceding siblings of /*/h1[2].

This expression is only problematic for the a elements that follow the last of the /*/h1 elements. this is why we add an additional predicate, that checks for non-existence of a next /*/h1 element and or this with the following boolean expressions.

Finally, as a guiding example for a Java implementation here is a complete XSLT transformation, which does something similar -- produces a serialized array, and can be mechanically translated to a corresponding Java solution:

<xsl:stylesheet version="1.0"
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
         xmlns:my="my:my">
         <xsl:output method="text"/>

         <my:null>null</my:null>
         <my:Q>"</my:Q>

         <xsl:variable name="vNull" select="document('')/*/my:null"/>
         <xsl:variable name="vQ" select="document('')/*/my:Q"/>

         <xsl:template match="/">
           <xsl:variable name="vGroup1" select=
           "/*/h1[1]
               /following-sibling::a
                  [not(/*/h1[2])
                 or
                   count(.|/*/h1[2]/preceding-sibling::a)
                  =
                   count(/*/h1[2]/preceding-sibling::a)
                  ]
           "/>

           <xsl:variable name="vGroup2" select=
           "/*/h1[2]
               /following-sibling::a
                  [not(/*/h1[3])
                 or
                   count(.|/*/h1[3]/preceding-sibling::a)
                  =
                   count(/*/h1[3]/preceding-sibling::a)
                  ]
           "/>

           <xsl:variable name="vGroup3" select=
           "/*/h1[3]
               /following-sibling::a
                  [not(/*/h1[4])
                 or
                  count(.|/*/h1[4]/preceding-sibling::a)
                  =
                   count(/*/h1[4]/preceding-sibling::a)
                  ]
           "/>

         [<xsl:value-of select=
          "concat($vQ[$vGroup1/self::a[1]],
                  $vGroup1/self::a[1],
                  $vQ[$vGroup1/self::a[1]],
                  $vNull[not($vGroup1/self::a[1])])"/>
          <xsl:text>,</xsl:text>

         <xsl:value-of select=
          "concat($vQ[$vGroup2/self::a[1]],
                  $vGroup2/self::a[1],
                  $vQ[$vGroup2/self::a[1]],
                  $vNull[not($vGroup2/self::a[1])])"/>
          <xsl:text>,</xsl:text>

         <xsl:value-of select=
          "concat($vQ[$vGroup3/self::a[1]],
                  $vGroup3/self::a[1],
                  $vQ[$vGroup3/self::a[1]],
                  $vNull[not($vGroup3/self::a[1])])"/>]
         </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the same XML document (above), the wanted, correct result is produced:

     ["a1",null,"a3"]

Update2:

Now the OP has added that he can use an XSLT solution. Here is one:

<xsl:stylesheet version="1.0"
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
         xmlns:my="my:my" exclude-result-prefixes="xsl">
         <xsl:output omit-xml-declaration="yes" indent="yes"/>

         <xsl:key name="kFollowing" match="a"
              use="generate-id(preceding-sibling::h1[1])"/>

         <my:null/>
         <xsl:variable name="vNull" select="document('')/*/my:null"/>

         <xsl:template match="/*">
           <xsl:copy-of select=
           "h1/following-sibling::a[1]
          |
            h1[not(key('kFollowing', generate-id()))]"/>

           =============================================

           <xsl:apply-templates select="h1"/>

         </xsl:template>

         <xsl:template match="h1">
           <xsl:variable name="vAsInGroup" select=
               "key('kFollowing', generate-id())"/>
           <xsl:copy-of select="$vAsInGroup[1] | $vNull[not($vAsInGroup)]"/>
         </xsl:template>
</xsl:stylesheet>

This transformation implements two different solutions. The difference is in what element is used to represent "null". In the first case it is the h1 element. This isn't recommended, because any h1 already has its own meaning which is different from "representing null". The second solution uses a special my:null element to represent null.

When this transformation is applied on the same XML document as above:

<t>
        <h1>heading1</h1>
        <a>a1</a>
        <b>b1</b>

        <h1>heading2</h1>
        <b>b2</b>

        <h1>heading3</h1>
        <a>a3</a>
        <b>b3</b>
        <c>c3</c>
</t>

each of the two XPath expressions (containing XSLT key() references) are evaluated and the selected nodes are output (above and below "========", respectively):

<a>a1</a>
<h1>heading2</h1>
<a>a3</a>

           =============================================

           <a>a1</a>
<my:null xmlns:my="my:my" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"/>
<a>a3</a>

Note on performance:

Because keys are used, this solution will be significantly more efficient when more than one search is made -- for example, when the corresponding arrays for a, b, and c need to be produced.

like image 34
Dimitre Novatchev Avatar answered Sep 30 '22 13:09

Dimitre Novatchev


I suggest you use the following, which might be rewritten to an xsl:function where the parent node name (here: div) is parametrized.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<xsl:template match="/">
    <root>
        <aList><xsl:copy-of select="$divIncludingNulls//a"/></aList>
        <bList><xsl:copy-of select="$divIncludingNulls//b"/></bList>
        <cList><xsl:copy-of select="$divIncludingNulls//c"/></cList>
    </root>
</xsl:template>

<xsl:variable name="divChild" select="distinct-values(//div/*/name())"/>

<xsl:variable name="divIncludingNulls">
    <xsl:for-each select="//div">
        <xsl:variable name="divElt" select="."/>
        <div>
            <xsl:for-each select="$divChild">
                <xsl:variable name="divEltvalue" select="$divElt/*[name()=current()]"/>
                <xsl:element name="{.}">
                    <xsl:choose>
                        <xsl:when test="$divEltvalue"><xsl:value-of select="$divEltvalue"/></xsl:when>
                        <xsl:otherwise>null</xsl:otherwise>
                    </xsl:choose>
                </xsl:element>
            </xsl:for-each>
       </div>
    </xsl:for-each>
</xsl:variable>

</xsl:stylesheet>

Applied to

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <div>
     <a>a1</a>
     <b>b1</b>
    </div>

    <div>
     <b>b2</b>
    </div>

    <div>
     <a>a3</a>
     <b>b3</b>
     <c>c3</c>
    </div>
</root>

the output is

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <aList>
        <a>a1</a>
        <a>null</a>
        <a>a3</a>
    </aList>
    <bList>
        <b>b1</b>
        <b>b2</b>
        <b>b3</b>
    </bList>
    <cList>
        <c>null</c>
        <c>null</c>
        <c>c3</c>
    </cList>
</root>
like image 45
Maestro13 Avatar answered Sep 30 '22 15:09

Maestro13