Fix unclosed tags in html or parse with HTML parser for XSLT transformation

Question

I have some HTML code that is the result of an XSLT tranformation. (XML->HTML)

I want to run another XSLT transformation on the result HTML. (HTML->HTML)

My problem is that the first transformation may return unclosed tags like "<img>", which means that i can't parse the result html with DocumentBuilder because it uses SAXparser and of course my html file is not a valid xml in all cases. (I get an exception that the following XY tag must be closed.)

I guess there are two solutions.

Either fix the result HTML by closing the unclosed tags.
Use some kind of HTML parser to get a valid org.w3c.dom.Document and skip XML parsers like SAX.

I would really like to use mainly the same method I used for the first transformation, so I would prefer one of the solutions above the problem is that I can't find any obvious 3rd party jars that can help. (Though i looked.) So basically I would like to know what are my options here, are there any solutions to this problem?

Any help would be greatly appreciated.

kaysush · Accepted Answer

What you need is Jsoup : Java HTML Parser. It has a functionality to output tidy HTML.

String html = "<p>The recurrence, in close succession <ul><li>list item 1</li><li>list item 2</li></ul> second part of thisssss";
String clean = Jsoup.clean(html, Whitelist.relaxed());

You can use other Whitelist also.

Mads Hansen · Answer

TagSoup - Just Keep On Truckin'

You could use TagSoup to ensure that all of the documents are well-formed.

...a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

TagSoup is designed for people who have to process this stuff using some semblance of a rational application design.

By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

If you are using Saxon, you can make TagSoup your parser by adding the following option:

...you can use the standard Saxon -x org.ccil.cowan.tagsoup.Parser option, after making sure that TagSoup is on your Java classpath.

I have used this to parse and transform HTML documents in a single pass and have found that it works great. It will read the document as a well-formed XHTML document available to be manipulated and transformed through XML tools.

Also, Taggle, a TagSoup in C++, available now

Fix unclosed tags in html or parse with HTML parser for XSLT transformation

Tags:

java

html

parsing

tags

xslt

Peter Jaloveczki

2 Answers

kaysush

TagSoup - Just Keep On Truckin'

Mads Hansen

Recent Activity

Donate For Us

Fix unclosed tags in html or parse with HTML parser for XSLT transformation

Tags:

java

html

parsing

tags

xslt

Peter Jaloveczki

2 Answers

kaysush

TagSoup - Just Keep On Truckin'

Mads Hansen

Related questions

Recent Activity

Donate For Us