extracting parts of HTML with groovy

Question

I need to extract a part of HTML from a given HTML page. So far, I use the XmlSlurper with tagsoup to parse the HTML page and then try to get the needed part by using the StreamingMarkupBuilder:

import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def dom = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText(html)
println    new StreamingMarkupBuilder().bindNode(dom.body)

However, the result I get is

<html:body xmlns:html='http://www.w3.org/1999/xhtml'>a <html:b>test</html:b></html:body>

which looks great, but I would like to get it without the html-namespace.

How do I avoid the namespace?

ataylor · Accepted Answer

Turn off the namespace feature on the TagSoup parser. Example:

import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature, false)
def dom = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)

extracting parts of HTML with groovy

Tags:

html

groovy

xmlslurper

rdmueller

1 Answers

ataylor

Recent Activity

Donate For Us

extracting parts of HTML with groovy

Tags:

html

groovy

xmlslurper

rdmueller

1 Answers

ataylor

Related questions

Recent Activity

Donate For Us