Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proper usage of JTidy to purify HTML

I am trying to use JTidy (jtidy-r938.jar) to sanitize an input HTML string, but I seem to have problems getting the default settings right. Often strings such as "hello world" end up as "helloworld" after tidying. I wanted to show what I'm doing here, and any pointers would be really appreciated:

Assume that rawHtml is the String containing the input (real world) HTML. This is what I'm doing:

        Tidy tidy = new Tidy();
        tidy.setPrintBodyOnly(true);

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        PrintStream ps = new PrintStream(baos);

        tidy.parse(new StringReader(rawHtml), ps);
        return baos.toString("UTF8");   

First off, does anything look fundamentally wrong with the above code? I seem to be getting weird results with this.

For example, consider the following input:

<p class="MsoNormal" style="text-autospace:none;"><font color="black"><span style="color:black;">???</span></font><b><font color="#7f0055"><span style="color:#7f0055;font-weight:bold;">private</span></font></b><font color="black"><span style="color:black;"> String parseDescription</span></font><font>

The output is:

<p class="MsoNormal" style="text-autospace:none;"><font color= "black"><span style="color:black;">&nbsp;&nbsp;&nbsp;</span></font> <b><font color="#7F0055"><span style= "color:#7f0055;font-weight:bold;">private</span></font></b><font color="black"><span style="color:black;">String parseDescription</span></font></p>

So,

"public String parseDescription" becomes "publicString parseDescription"

Thanks in advance!

like image 636
ragebiswas Avatar asked Mar 30 '10 16:03

ragebiswas


2 Answers

Have a look at how JTidy is configured:

StringWriter writer = new StringWriter();
tidy.getConfiguration().printConfigOptions(writer, true);
System.out.println(writer.toString());

Maybe it then get clear what causes the problem.

What is weird? Little example, of actual output and expected... maybe ?

like image 181
Verhagen Avatar answered Sep 21 '22 18:09

Verhagen


Well, this seems to be a bug in Jtidy. For the exact file which causes problems, refer here:

http://sourceforge.net/tracker/?func=detail&aid=2985849&group_id=13153&atid=113153

Thanks for all the help folks!

like image 43
ragebiswas Avatar answered Sep 21 '22 18:09

ragebiswas