XMLParser is eating my whitespace

Question

I am losing significant whitespace from a wiki page I am parsing and I'm thinking it's because of the parser. I have this in my Groovy script:

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
slurper.keepWhitespace = true
inputStream.withStream{ doc = slurper.parse(it) 
println "originalContent = " + doc.'**'.find{ it.@id == 'editpageform' }.'**'.find { it.@name=='originalContent'}.@value
}

Where inputStream is initialized from a URL GET request to edit a confluence wiki page. Later on in the withInputStream block where I do this:

println "originalContent = " + doc.'**'.find{ it.@id == 'editpageform' }.'**'.find { it.@name=='originalContent'}.@value

I notice all the original content of the page is stripped of its newlines. I originally thought it was a server-side thing but when I went to make the same req in my browser and view source I could see newlines in the "originalContent" hidden parameter. Is there an easy way to disable the whitespace normalization and preserve the contents of the field? The above was run against a internal Confluence wiki page but could most likely be reproved when editing any arbitrary wiki page.

Updated above I added a call to "slurped.keepWhitespace = true" in an attempt to preserve whitespace but that still doesn't work. I'm thinking this method is intended for elements and not attributes? Is there a way to easily tweak flags on the underlying Java XMLParser? Is there a specific setting to set for whitespace in attribute values?

stackmagic · Accepted Answer

I first tried to reproduce this with some confluence page of my own, but there was no value attribute and no text content in the input node, so I created my own test html.

Now, I figured the tagsoup parser would need to be configured to preserve whitespace too, just setting this on the slurper won't help because the default is to ignore whitespace.

So I've done just this, the tagsoup feature ignorable-whitespace is documented btw. (search for whitespace on the page)

Anyway, it doesn't work. Whitespace from attributes is preserved as you can see from the example and preserving text whitespace doesn't seem to work despite setting the extra feature. Maybe this is a bug in tagsoup or the xml slurper?

I suggest you have a closer look at your html too, is there really a value attribute present?

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )

String html = """\
<html><head><title>test</title></head><body>
<p>
    <form id="editpageform">
        <p>
            <input name="originalContent" value="         ">         

            </input>
        </p>
    </form>
</p>
</body></html>
"""
def inputStream = new ByteArrayInputStream(html.getBytes())

def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature("http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace", true)

def slurper = new XmlSlurper(parser)
slurper.keepWhitespace = true

inputStream.withStream{ doc = slurper.parse(it) 
    def parse = { doc.'**'.find{ it.@id == 'editpageform' }.'**'.find { it.@name=='originalContent'} }
    println "originalContent (name)  = '${parse().@name}'"
    println "originalContent (value) = '${parse().@value}'"
    println "originalContent (text)  = '${parse().text()}'"
}

XMLParser is eating my whitespace

Tags:

java

xml

xml-parsing

groovy

Cliff

1 Answers

stackmagic

Recent Activity

Donate For Us

XMLParser is eating my whitespace

Tags:

java

xml

xml-parsing

groovy

Cliff

1 Answers

stackmagic

Related questions

Recent Activity

Donate For Us