Jsoup.clean without adding html entities

Question

I'm cleaning some text from unwanted HTML tags (such as <script>) by using

String clean = Jsoup.clean(someInput, Whitelist.basicWithImages());

The problem is that it replaces for instance å with å (which causes troubles for me since it's not "pure xml").

For example

Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages())

yields

"hello &aring;  world"

but I would like

"hello å  world"

Is there a simple way to achieve this? (I.e. simpler than converting å back to å in the result.)

bmoc · Accepted Answer

You can configure Jsoup's escaping mode: Using EscapeMode.xhtml will give you output w/o entities.

Here's a complete snippet that accepts str as input, and cleans it using Whitelist.simpleText():

// Parse str into a Document Document doc = Jsoup.parse(str);  // Clean the document. doc = new Cleaner(Whitelist.simpleText()).clean(doc);  // Adjust escape mode doc.outputSettings().escapeMode(EscapeMode.xhtml);  // Get back the string of the body. str = doc.body().html();

Jsoup.clean without adding html entities

Tags:

java

html

html-entities

jsoup

aioobe

1 Answers

bmoc

Recent Activity

Donate For Us

Jsoup.clean without adding html entities

Tags:

java

html

html-entities

jsoup

aioobe

1 Answers

bmoc

Related questions

Recent Activity

Donate For Us