I have a large number of XML files that contain URLs. I'm writing a groovy utility to find each URL and replace it with an updated version.
Given example.xml:
<?xml version="1.0" encoding="UTF-8"?>
<page>
<content>
<section>
<link>
<url>/some/old/url</url>
</link>
<link>
<url>/some/old/url</url>
</link>
</section>
<section>
<link>
<url>
/a/different/old/url?with=specialChars&escaped=true
</url>
</link>
</section>
</content>
</page>
Once the script has run, example.xml should contain:
<?xml version="1.0" encoding="UTF-8"?>
<page>
<content>
<section>
<link>
<url>/a/new/and/improved/url</url>
</link>
<link>
<url>/a/new/and/improved/url</url>
</link>
</section>
<section>
<link>
<url>
/a/different/new/and/improved/url?with=specialChars&stillEscaped=true
</url>
</link>
</section>
</content>
</page>
This is easy to do using groovy's excellent xml support, except that I want to change the URLs and nothing else about the file.
By that I mean:
So far, after trying many combinations of XmlParser, DOMBuilder, XmlNodePrinter, XmlUtil.serialize(), and so on, I've landed on reading each file line-by-line and applying an ugly hybrid of the xml utilities and regular expressions.
Reading and writing each file:
files.each { File file ->
def lineEnding = file.text.contains('\r\n') ? '\r\n' : '\n'
def newLineAtEof = file.text.endsWith(lineEnding)
def lines = file.readLines()
file.withWriter { w ->
lines.eachWithIndex { line, index ->
line = update(line)
w.write(line)
if (index < lines.size-1) w.write(lineEnding)
else if (newLineAtEof) w.write(lineEnding)
}
}
}
Searching for and updating URLs within a line:
def matcher = (line =~ urlTagRegexp) //matches a <url> element and its contents
matcher.each { groups ->
def urlNode = new XmlParser().parseText(line)
def url = urlNode.text()
def newUrl = translate(url)
if (newUrl) {
urlNode.value = newUrl
def replacement = nodeToString(urlNode)
line = matcher.replaceAll(replacement)
}
}
def nodeToString(node) {
def writer = new StringWriter()
writer.withPrintWriter { printWriter ->
def printer = new XmlNodePrinter(printWriter)
printer.preserveWhitespace = true
printer.print(node)
}
writer.toString().replaceAll(/[\r\n]/, '')
}
This mostly works, except it can't handle a tag split over multiple lines, and messing with newlines when writing the files back out is cumbersome.
I'm new to groovy, but I feel like there must be a groovier way of doing this.
I just created gist at: https://gist.github.com/akhikhl/8070808 to demonstrate how such transformation is done with Groovy and JDOM2.
Important notes:
Update
The following code:
new XMLOutputter().with {
format = Format.getRawFormat()
format.setLineSeparator(LineSeparator.NONE)
output(doc, System.out)
}
solves a problem with preserving whitespaces and line separators. getRawFormat constructs a format object that preserves whitespaces. LineSeparator.NONE instructs format object, that it should not convert line separators.
The gist mentioned above contains this new code as well.
There is a solution without any 3rd party library.
def xml = file.text
def document = groovy.xml.DOMBuilder.parse(new StringReader(xml))
def root = document.documentElement
use(groovy.xml.dom.DOMCategory) {
// manipulate the XML here, i.e. root.someElement?.each { it.value = 'new value'}
}
def result = groovy.xml.dom.DOMUtil.serialize(root)
file.withWriter { w ->
w.write(result)
}
Taken from http://jonathan-whywecanthavenicethings.blogspot.de/2011/07/keep-your-hands-off-of-my-whitespace.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With