Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XMLParser is eating my whitespace

I am losing significant whitespace from a wiki page I am parsing and I'm thinking it's because of the parser. I have this in my Groovy script:

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
slurper.keepWhitespace = true
inputStream.withStream{ doc = slurper.parse(it) 
println "originalContent = " + doc.'**'.find{ it.@id == 'editpageform' }.'**'.find { it.@name=='originalContent'}.@value
}

Where inputStream is initialized from a URL GET request to edit a confluence wiki page. Later on in the withInputStream block where I do this:

println "originalContent = " + doc.'**'.find{ it.@id == 'editpageform' }.'**'.find { it.@name=='originalContent'}.@value

I notice all the original content of the page is stripped of its newlines. I originally thought it was a server-side thing but when I went to make the same req in my browser and view source I could see newlines in the "originalContent" hidden parameter. Is there an easy way to disable the whitespace normalization and preserve the contents of the field? The above was run against a internal Confluence wiki page but could most likely be reproved when editing any arbitrary wiki page.

Updated above I added a call to "slurped.keepWhitespace = true" in an attempt to preserve whitespace but that still doesn't work. I'm thinking this method is intended for elements and not attributes? Is there a way to easily tweak flags on the underlying Java XMLParser? Is there a specific setting to set for whitespace in attribute values?

like image 265
Cliff Avatar asked Nov 14 '22 04:11

Cliff


1 Answers

I first tried to reproduce this with some confluence page of my own, but there was no value attribute and no text content in the input node, so I created my own test html.

Now, I figured the tagsoup parser would need to be configured to preserve whitespace too, just setting this on the slurper won't help because the default is to ignore whitespace.

So I've done just this, the tagsoup feature ignorable-whitespace is documented btw. (search for whitespace on the page)

Anyway, it doesn't work. Whitespace from attributes is preserved as you can see from the example and preserving text whitespace doesn't seem to work despite setting the extra feature. Maybe this is a bug in tagsoup or the xml slurper?

I suggest you have a closer look at your html too, is there really a value attribute present?

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )

String html = """\
<html><head><title>test</title></head><body>
<p>
    <form id="editpageform">
        <p>
            <input name="originalContent" value="         ">         

            </input>
        </p>
    </form>
</p>
</body></html>
"""
def inputStream = new ByteArrayInputStream(html.getBytes())

def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature("http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace", true)

def slurper = new XmlSlurper(parser)
slurper.keepWhitespace = true

inputStream.withStream{ doc = slurper.parse(it) 
    def parse = { doc.'**'.find{ it.@id == 'editpageform' }.'**'.find { it.@name=='originalContent'} }
    println "originalContent (name)  = '${parse().@name}'"
    println "originalContent (value) = '${parse().@value}'"
    println "originalContent (text)  = '${parse().text()}'"
}
like image 162
stackmagic Avatar answered Nov 16 '22 04:11

stackmagic