preserve formatting when updating xml file with groovy

Question

I have a large number of XML files that contain URLs. I'm writing a groovy utility to find each URL and replace it with an updated version.

Given example.xml:

<?xml version="1.0" encoding="UTF-8"?>
<page>
    <content>
        <section>
            <link>
                <url>/some/old/url</url>
            </link>
            <link>
                <url>/some/old/url</url>
            </link>
        </section>
        <section>
            <link>
                <url>
                    /a/different/old/url?with=specialChars&amp;escaped=true
                </url>
            </link>
        </section>
    </content>
</page>

Once the script has run, example.xml should contain:

<?xml version="1.0" encoding="UTF-8"?>
<page>
    <content>
        <section>
            <link>
                <url>/a/new/and/improved/url</url>
            </link>
            <link>
                <url>/a/new/and/improved/url</url>
            </link>
        </section>
        <section>
            <link>
                <url>
                    /a/different/new/and/improved/url?with=specialChars&amp;stillEscaped=true
                </url>
            </link>
        </section>
    </content>
</page>

This is easy to do using groovy's excellent xml support, except that I want to change the URLs and nothing else about the file.

By that I mean:

whitespace must not change (files might contain spaces, tabs, or both)
comments must be preserved
windows vs. unix-style line separators must be preserved
the xml declaration at the top must not be added or removed
attributes in tags must retain their order

So far, after trying many combinations of XmlParser, DOMBuilder, XmlNodePrinter, XmlUtil.serialize(), and so on, I've landed on reading each file line-by-line and applying an ugly hybrid of the xml utilities and regular expressions.

Reading and writing each file:

files.each { File file ->
    def lineEnding = file.text.contains('
') ? '
' : '
'
    def newLineAtEof = file.text.endsWith(lineEnding)
    def lines = file.readLines()
    file.withWriter { w ->
        lines.eachWithIndex { line, index ->
            line = update(line)
            w.write(line)
            if (index < lines.size-1) w.write(lineEnding)
            else if (newLineAtEof) w.write(lineEnding)
        }
    }
}

Searching for and updating URLs within a line:

def matcher = (line =~ urlTagRegexp) //matches a <url> element and its contents
matcher.each { groups ->
    def urlNode = new XmlParser().parseText(line)
    def url = urlNode.text()
    def newUrl = translate(url)
    if (newUrl) {
        urlNode.value = newUrl
        def replacement = nodeToString(urlNode)
        line = matcher.replaceAll(replacement)
    }
}

def nodeToString(node) {
    def writer = new StringWriter()
    writer.withPrintWriter { printWriter ->
        def printer = new XmlNodePrinter(printWriter)
        printer.preserveWhitespace = true
        printer.print(node)
    }
    writer.toString().replaceAll(/[
]/, '')
}

This mostly works, except it can't handle a tag split over multiple lines, and messing with newlines when writing the files back out is cumbersome.

I'm new to groovy, but I feel like there must be a groovier way of doing this.

akhikhl · Accepted Answer

I just created gist at: https://gist.github.com/akhikhl/8070808 to demonstrate how such transformation is done with Groovy and JDOM2.

Important notes:

Groovy technically allows using any java libraries. If something cannot be done with Groovy JDK, it can be done with other library.
jaxen library (implementing XPath) should be included explicitly (via @Grab or via maven/gradle), since it's an optional dependency of JDOM2.
The sequence of @Grab/@GrabExclude instructions fixes the quirky dependence of jaxen on JDOM-1.0.
XPathFactory.compile also supports variable binding and filters (see online javadoc).
XPathExpression (which is returned by compile) supports two major functions - evaluate and evaluateFirst. evaluate always returns a list of all XML-nodes, satisfying the specified predicate, while evaluateFirst returns just the first matching XML-node.

Update

The following code:

new XMLOutputter().with {
  format = Format.getRawFormat()
  format.setLineSeparator(LineSeparator.NONE)
  output(doc, System.out)
}

solves a problem with preserving whitespaces and line separators. getRawFormat constructs a format object that preserves whitespaces. LineSeparator.NONE instructs format object, that it should not convert line separators.

The gist mentioned above contains this new code as well.

banterCZ · Answer

There is a solution without any 3rd party library.

def xml = file.text
def document = groovy.xml.DOMBuilder.parse(new StringReader(xml))
def root = document.documentElement
use(groovy.xml.dom.DOMCategory) {
    // manipulate the XML here, i.e. root.someElement?.each { it.value = 'new value'}
}

def result = groovy.xml.dom.DOMUtil.serialize(root)

file.withWriter { w ->
    w.write(result)
}

Taken from http://jonathan-whywecanthavenicethings.blogspot.de/2011/07/keep-your-hands-off-of-my-whitespace.html

preserve formatting when updating xml file with groovy

Tags:

regex

xml

groovy

Alex Wittig

2 Answers

akhikhl

banterCZ

Recent Activity

Donate For Us

preserve formatting when updating xml file with groovy

Tags:

regex

xml

groovy

Alex Wittig

2 Answers

akhikhl

banterCZ

Related questions

Recent Activity

Donate For Us