Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup.clean without adding html entities

I'm cleaning some text from unwanted HTML tags (such as <script>) by using

String clean = Jsoup.clean(someInput, Whitelist.basicWithImages()); 

The problem is that it replaces for instance å with &aring; (which causes troubles for me since it's not "pure xml").

For example

Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages()) 

yields

"hello &aring;  world" 

but I would like

"hello å  world" 

Is there a simple way to achieve this? (I.e. simpler than converting &aring; back to å in the result.)

like image 657
aioobe Avatar asked Dec 30 '11 19:12

aioobe


1 Answers

You can configure Jsoup's escaping mode: Using EscapeMode.xhtml will give you output w/o entities.

Here's a complete snippet that accepts str as input, and cleans it using Whitelist.simpleText():

// Parse str into a Document Document doc = Jsoup.parse(str);  // Clean the document. doc = new Cleaner(Whitelist.simpleText()).clean(doc);  // Adjust escape mode doc.outputSettings().escapeMode(EscapeMode.xhtml);  // Get back the string of the body. str = doc.body().html(); 
like image 194
bmoc Avatar answered Sep 21 '22 19:09

bmoc