How to remove HTML Entities using Jsoup? If I use Element.toString(), I get:
(...)
<td>Letter ó</td> //valid: <td>Letter ó</td>
(...)
clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.
Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
This may be off-topic to the context of your question, but if you want to just decode HTML-entities without any other changes in the string (no tag processing, no comment stripping, etc) you can use org.jsoup.parser.Parser.unescapeEntities
, e.g.:
assert Parser.unescapeEntities("x ≈ <i>y</i>\n", true)
.equals("x ≈ <i>y</i>\n");
I believe you can specify an encoding when you create a Jsoup Document something like this:
Document newDocument = Jsoup.parse(htmlString, StringUtils.EMPTY, Parser.htmlParser());
newDocument.outputSettings().escapeMode(EscapeMode.base);
newDocument.outputSettings().charset(CharEncoding.UTF_8);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With