I have some HTML code that is the result of an XSLT tranformation. (XML->HTML)
I want to run another XSLT transformation on the result HTML. (HTML->HTML)
My problem is that the first transformation may return unclosed tags like "<img>
", which means that i can't parse the result html with DocumentBuilder because it uses SAXparser and of course my html file is not a valid xml in all cases. (I get an exception that the following XY tag must be closed.)
I guess there are two solutions.
Either fix the result HTML by closing the unclosed tags.
Use some kind of HTML parser to get a valid org.w3c.dom.Document and skip XML parsers like SAX.
I would really like to use mainly the same method I used for the first transformation, so I would prefer one of the solutions above the problem is that I can't find any obvious 3rd party jars that can help. (Though i looked.) So basically I would like to know what are my options here, are there any solutions to this problem?
Any help would be greatly appreciated.
What you need is Jsoup : Java HTML Parser
. It has a functionality to output tidy HTML.
String html = "<p>The recurrence, in close succession <ul><li>list item 1</li><li>list item 2</li></ul> second part of thisssss";
String clean = Jsoup.clean(html, Whitelist.relaxed());
You can use other Whitelist also.
You could use TagSoup to ensure that all of the documents are well-formed.
...a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.
TagSoup is designed for people who have to process this stuff using some semblance of a rational application design.
By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
If you are using Saxon, you can make TagSoup your parser by adding the following option:
...you can use the standard Saxon
-x org.ccil.cowan.tagsoup.Parser
option, after making sure that TagSoup is on your Java classpath.
I have used this to parse and transform HTML documents in a single pass and have found that it works great. It will read the document as a well-formed XHTML document available to be manipulated and transformed through XML tools.
Also, Taggle, a TagSoup in C++, available now
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With