I have come across a very annoying problem when using jTidy (on Android). I have found jTidy works on every HTML Document I have tested it against, except the following:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<!-- Always force latest IE rendering engine & Chrome Frame
Remove this if you use the .htaccess -->
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<title>templates</title>
<meta name="description" content="" />
<meta name="author" content="" />
<meta name="viewport" content="width=device-width; initial-scale=1.0" />
<!-- Replace favicon.ico & apple-touch-icon.png in the root of your domain and delete these references -->
<link rel="shortcut icon" href="/favicon.ico" />
<link rel="apple-touch-icon" href="/apple-touch-icon.png" />
</head>
<body>
<div>
<header>
<h1>Page Heading</h1>
</header>
<nav>
<p><a href="/">Home</a></p>
<p><a href="/contact">Contact</a></p>
</nav>
<div>
</div>
<footer>
<p>© Copyright</p>
</footer>
</div>
</body>
</html>
But after tidying it, jTidy returns nothing (as in, if the String containing the Tidied HTML is called result, result.equals("") == true)
I have noticed something very interesting though: if I remove everything in the body part of the HTML jTidy works perfectly. Is there something in the <body></body> jTidy doesn't like?
Here is the Java code I am using:
public String tidy(String sourceHTML) {
StringReader reader = new StringReader(sourceHTML);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Tidy tidy = new Tidy();
tidy.setMakeClean(true);
tidy.setQuiet(false);
tidy.setIndentContent(true);
tidy.setSmartIndent(true);
tidy.parse(reader, baos);
try {
return baos.toString(mEncoding);
} catch (UnsupportedEncodingException e) {
return null;
}
}
Is there something wrong with my Java? Is this an error with jTidy? Is there any way I can make jTidy not do this? (I cannot change the HTML). If this absolutely cannot be fixed, are there any other good HTML Tidiers? Thanks very much!
Try this:
tidy.setForceOutput(true);
There are probably parse errors.
Check out Jsoup, it's my recommendation for any kind of Java Html processing (i've used HtmlCleaner to, but then switched to jsoup).
Cleaning Html with Jsoup:
final String yourHtml = ...
String output = Jsoup.clean(yourHtml, Whitelist.relaxed());
Thats all!
Or (if you want to change / remove / parse / ...) something:
Document doc = Jsoup.parse(<file/string/website>, null);
String output = doc.toString();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With