Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort elements of arbitrary XML document recursively

Tags:

xml

groovy

xslt

I'm trying to sort and canonicalize some XML documents. The desired end result is that:

  1. every element's children are in alphabetical order
  2. every elements attributes are in alphabetical order
  3. comments are removed
  4. all elements are properly spaced (i.e. "pretty print").

I have achieved all of these goals except #1.

I have been using this answer as my template. Here is what I have so far:

import javax.xml.transform.stream.StreamResult
import javax.xml.transform.stream.StreamSource
import javax.xml.transform.TransformerFactory
import org.apache.xml.security.c14n.Canonicalizer

// Initialize the security library
org.apache.xml.security.Init.init()

// Create some variables

// Get arguments

// Make sure required arguments have been provided

if(!error) {
    // Create some variables
    def ext = fileInName.tokenize('.').last()
    fileOutName = fileOutName ?: "${fileInName.lastIndexOf('.').with {it != -1 ? fileInName[0..<it] : fileInName}}_CANONICALIZED_AND_SORTED.${ext}"
    def fileIn = new File(fileInName)
    def fileOut = new File(fileOutName)
    def xsltFile = new File(xsltName)
    def temp1 = new File("./temp1")
    def temp2 = new File("./temp2")
    def os
    def is

    // Sort the XML attributes, remove comments, and remove extra whitespace
    println "Canonicalizing..."
    Canonicalizer c = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS)
    os = temp1.newOutputStream()
    c.setWriter(os)
    c.canonicalize(fileIn.getBytes())
    os.close()

    // Sort the XML elements
    println "Sorting..."
    def factory = TransformerFactory.newInstance()
    is = xsltFile.newInputStream()
    def transformer = factory.newTransformer(new StreamSource(is))
    is.close()
    is = temp1.newInputStream()
    os = temp2.newOutputStream()
    transformer.transform(new StreamSource(is), new StreamResult(os))
    is.close()
    os.close()

    // Write the XML output in "pretty print"
    println "Beautifying..."
    def parser = new XmlParser()
    def printer = new XmlNodePrinter(new IndentPrinter(fileOut.newPrintWriter(), "    ", true))
    printer.print parser.parseText(temp2.getText())

    // Cleanup
    temp1.delete()
    temp2.delete()

    println "Done!"
}

Full script is here.

XSLT:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="foo">
    <foo>
      <xsl:apply-templates>
        <xsl:sort select="name()"/>
      </xsl:apply-templates>
    </foo>
  </xsl:template>
</xsl:stylesheet>

Sample Input XML:

<foo b="b" a="a" c="c">
    <qwer>
    <zxcv c="c" b="b"/>
    <vcxz c="c" b="b"/>
    </qwer>
    <baz e="e" d="d"/>
    <bar>
    <fdsa g="g" f="f"/>
    <asdf g="g" f="f"/>
    </bar>
</foo>

Desired Output XML:

<foo a="a" b="b" c="c">
    <bar>
        <asdf f="f" g="g"/>
        <fdsa f="f" g="g"/>
    </bar>
    <baz d="d" e="e"/>
    <qwer>
        <vcxz b="b" c="c"/>
        <zxcv b="b" c="c"/>
    </qwer>
</foo>

How can I make the transform apply to all elements so all of an element's children will be in alphabetical order?

like image 205
ubiquibacon Avatar asked Sep 09 '13 21:09

ubiquibacon


1 Answers

If you want to make the transform apply to all elements, you need a template to match all elements, as opposed to having a template that just matches the specific "foo" element

<xsl:template match="*">

Note that, you would have to change the current template that matches "node()" to exclude elements:

 <xsl:template match="node()[not(self::*)]|@*">

Within this template, you will also need code to select the attributes, because your "foo" template at the moment will ignore them (<xsl:apply-templates /> does not select attributes).

Actually, looking at your requirements, items 1 to 3 can all be done with a single XSLT. For example, to remove comments, you could just ignore it from the template that currently matches node()

<xsl:template match="node()[not(self::comment())][not(self::*)]|@*">

Try the following XSLT, will should achieve points 1 to 3

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="node()[not(self::comment())][not(self::*)]|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates select="@*">
        <xsl:sort select="name()"/>
      </xsl:apply-templates>
      <xsl:apply-templates>
        <xsl:sort select="name()"/>
      </xsl:apply-templates>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

EDIT: The template <xsl:template match="node()[not(self::comment())][not(self::*)]|@*"> can actually be replaced with just <xsl:template match="processing-instruction()|@*"> which may increase readability. This is because "node()" matches elements, text nodes, comments and processing instructions. In your XSLT, elements are picked up by the other template, text nodes by the built-in template, and comments you want to ignore, leaving just processing instructions.

like image 72
Tim C Avatar answered Sep 28 '22 04:09

Tim C