I'm cleaning some text from unwanted HTML tags (such as <script>) by using
String clean = Jsoup.clean(someInput, Whitelist.basicWithImages()); The problem is that it replaces for instance å with å (which causes troubles for me since it's not "pure xml").
For example
Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages()) yields
"hello å world" but I would like
"hello å world" Is there a simple way to achieve this? (I.e. simpler than converting å back to å in the result.)
You can configure Jsoup's escaping mode: Using EscapeMode.xhtml will give you output w/o entities.
Here's a complete snippet that accepts str as input, and cleans it using Whitelist.simpleText():
// Parse str into a Document Document doc = Jsoup.parse(str); // Clean the document. doc = new Cleaner(Whitelist.simpleText()).clean(doc); // Adjust escape mode doc.outputSettings().escapeMode(EscapeMode.xhtml); // Get back the string of the body. str = doc.body().html();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With