Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Fix unclosed tags in html or parse with HTML parser for XSLT transformation

I have some HTML code that is the result of an XSLT tranformation. (XML->HTML)

I want to run another XSLT transformation on the result HTML. (HTML->HTML)

My problem is that the first transformation may return unclosed tags like "<img>", which means that i can't parse the result html with DocumentBuilder because it uses SAXparser and of course my html file is not a valid xml in all cases. (I get an exception that the following XY tag must be closed.)

I guess there are two solutions.

  1. Either fix the result HTML by closing the unclosed tags.

  2. Use some kind of HTML parser to get a valid org.w3c.dom.Document and skip XML parsers like SAX.

I would really like to use mainly the same method I used for the first transformation, so I would prefer one of the solutions above the problem is that I can't find any obvious 3rd party jars that can help. (Though i looked.) So basically I would like to know what are my options here, are there any solutions to this problem?

Any help would be greatly appreciated.

like image 583
Peter Jaloveczki Avatar asked Mar 04 '13 14:03

Peter Jaloveczki

2 Answers

What you need is Jsoup : Java HTML Parser. It has a functionality to output tidy HTML.

String html = "<p>The recurrence, in close succession <ul><li>list item 1</li><li>list item 2</li></ul> second part of thisssss";
String clean = Jsoup.clean(html, Whitelist.relaxed());

You can use other Whitelist also.

like image 175
kaysush Avatar answered Nov 14 '22 21:11


alt textTagSoup - Just Keep On Truckin'alt text

You could use TagSoup to ensure that all of the documents are well-formed.

...a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

TagSoup is designed for people who have to process this stuff using some semblance of a rational application design.

By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

If you are using Saxon, you can make TagSoup your parser by adding the following option:

...you can use the standard Saxon -x org.ccil.cowan.tagsoup.Parser option, after making sure that TagSoup is on your Java classpath.

I have used this to parse and transform HTML documents in a single pass and have found that it works great. It will read the document as a well-formed XHTML document available to be manipulated and transformed through XML tools.

Also, Taggle, a TagSoup in C++, available now

like image 28
Mads Hansen Avatar answered Nov 14 '22 23:11

Mads Hansen