Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup clean method leaves   elements

Tags:

java

html

jsoup

I was trying using this code to clean my text entirely from HTML elements:

Jsoup.clean(preparedText, Whitelist.none())

Unfortunately it didn't remove the   elements. I thought that it will replace it with a whitespace, the same way as it replace the · with a middle dot ("·").

Should I use another method in order to achieve this functionality?

like image 300
Ziv Gabovitch Avatar asked Jan 19 '16 09:01

Ziv Gabovitch


People also ask

What does Jsoup clean do?

clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.

How do I delete a tag in Jsoup?

Document docsoup = Jsoup. parse(htmlin); docsoup. head(). remove();


1 Answers

From the Jsoup docs:

Whitelists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.

So the whitelist are concerned only with tags and attributes.   is neither a tag nor an attribute. It is simply the html encoding for a special character. If you want to translate from the encoding to normal text you may use for example the excellent apache commons lang library or use the Jsoup unescapeEntities method:

System.out.println(Parser.unescapeEntities(doc.toString(), false));

Addendum:

The translation from · to "·" already happens when you parse the html. It does not seem to have to do with the clean method.

like image 146
luksch Avatar answered Sep 20 '22 08:09

luksch