I was trying using this code to clean my text entirely from HTML elements:
Jsoup.clean(preparedText, Whitelist.none())
Unfortunately it didn't remove the
elements. I thought that it will replace it with a whitespace, the same way as it replace the ·
with a middle dot ("·").
Should I use another method in order to achieve this functionality?
clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.
Document docsoup = Jsoup. parse(htmlin); docsoup. head(). remove();
From the Jsoup docs:
Whitelists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.
So the whitelist are concerned only with tags and attributes.
is neither a tag nor an attribute. It is simply the html encoding for a special character. If you want to translate from the encoding to normal text you may use for example the excellent apache commons lang library or use the Jsoup unescapeEntities method:
System.out.println(Parser.unescapeEntities(doc.toString(), false));
Addendum:
The translation from ·
to "·" already happens when you parse the html. It does not seem to have to do with the clean method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With