Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

jTidy returns nothing after tidying HTML

I have come across a very annoying problem when using jTidy (on Android). I have found jTidy works on every HTML Document I have tested it against, except the following:

    <!DOCTYPE html>
      <html lang="en">
       <head>
        <meta charset="utf-8" />

         <!-- Always force latest IE rendering engine & Chrome Frame 
              Remove this if you use the .htaccess -->
         <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

         <title>templates</title>
         <meta name="description" content="" />
         <meta name="author" content="" />

         <meta name="viewport" content="width=device-width; initial-scale=1.0" />

         <!-- Replace favicon.ico & apple-touch-icon.png in the root of your domain and delete these references -->
      <link rel="shortcut icon" href="/favicon.ico" />
      <link rel="apple-touch-icon" href="/apple-touch-icon.png" />
   </head>

 <body>
   <div>
     <header>
       <h1>Page Heading</h1>
     </header>
     <nav>
       <p><a href="/">Home</a></p>
       <p><a href="/contact">Contact</a></p>
     </nav>

     <div>

     </div>

     <footer>
      <p>&copy; Copyright</p>
     </footer>
   </div>
 </body>
 </html>

But after tidying it, jTidy returns nothing (as in, if the String containing the Tidied HTML is called result, result.equals("") == true)

I have noticed something very interesting though: if I remove everything in the body part of the HTML jTidy works perfectly. Is there something in the <body></body> jTidy doesn't like?

Here is the Java code I am using:

 public String tidy(String sourceHTML) {
   StringReader reader = new StringReader(sourceHTML);

   ByteArrayOutputStream baos = new ByteArrayOutputStream();
   Tidy tidy = new Tidy();
   tidy.setMakeClean(true);
   tidy.setQuiet(false);
   tidy.setIndentContent(true);
   tidy.setSmartIndent(true);

   tidy.parse(reader, baos);

   try {
     return baos.toString(mEncoding);
   } catch (UnsupportedEncodingException e) {
     return null;
   }
 }

Is there something wrong with my Java? Is this an error with jTidy? Is there any way I can make jTidy not do this? (I cannot change the HTML). If this absolutely cannot be fixed, are there any other good HTML Tidiers? Thanks very much!

like image 255
Henry Thompson Avatar asked Jan 16 '12 19:01

Henry Thompson


2 Answers

Try this:

tidy.setForceOutput(true);

There are probably parse errors.

like image 134
linuxdan Avatar answered Sep 30 '22 18:09

linuxdan


Check out Jsoup, it's my recommendation for any kind of Java Html processing (i've used HtmlCleaner to, but then switched to jsoup).

Cleaning Html with Jsoup:

final String yourHtml = ...

String output = Jsoup.clean(yourHtml, Whitelist.relaxed());

Thats all!

Or (if you want to change / remove / parse / ...) something:

Document doc = Jsoup.parse(<file/string/website>, null);

String output = doc.toString();
like image 43
ollo Avatar answered Sep 30 '22 18:09

ollo