Is there a way of getting jsoup to clean a string with HTML in it by escaping the unwanted HTML rather than removing it completely? My example:
String dirty = "This is <b>REALLY</b> dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>
String clean = Jsoup.clean(dirty, new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));
This gives a "clean" string of:
This is REALLY dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>
What I am wanting is the "clean" string to be:
"This is <b>REALLY</b> dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>
Assuming String rather than HTML documents are being parsed (as per your question) this method will work:
public String escapeHtml(String source) {
Document doc = Jsoup.parseBodyFragment(source);
Elements elements = doc.select("b");
for (Element element : elements) {
element.replaceWith(new TextNode(element.toString(),""));
}
return Jsoup.clean(doc.body().toString(), new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));
}
You could make the "b" tag an argument to pass in a list of tags you wish to escape.
The associated passing JUnit test:
@Test
public void testHtmlEscaping() throws Exception {
String source = "This is <b>REALLY</b> dirty code from <a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
String expected = "This is <b>REALLY</b> dirty code from \n<a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
String transformed = transformer.escapeHtml(source);
assertEquals(transformed, expected);
}
Note that I added a line return "\n" before your "a" tag in my test's "expected" String because JSoup formats the page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With