Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove HTML Entities in Jsoup?

Tags:

java

html

jsoup

How to remove HTML Entities using Jsoup? If I use Element.toString(), I get:

(...)
       <td>Letter &oacute;</td> //valid: <td>Letter ó</td>
(...)
like image 255
barwnikk Avatar asked Nov 13 '13 20:11

barwnikk


People also ask

What does jsoup clean do?

clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.

Is jsoup deprecated?

Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .

What is jsoup API?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.


2 Answers

This may be off-topic to the context of your question, but if you want to just decode HTML-entities without any other changes in the string (no tag processing, no comment stripping, etc) you can use org.jsoup.parser.Parser.unescapeEntities, e.g.:

assert Parser.unescapeEntities("x &asymp; <i>y</i>\n", true)
    .equals("x ≈ <i>y</i>\n");
like image 63
Sasha Avatar answered Sep 27 '22 23:09

Sasha


I believe you can specify an encoding when you create a Jsoup Document something like this:

Document newDocument = Jsoup.parse(htmlString, StringUtils.EMPTY, Parser.htmlParser());
newDocument.outputSettings().escapeMode(EscapeMode.base);
newDocument.outputSettings().charset(CharEncoding.UTF_8);
like image 21
Алексей Avatar answered Sep 27 '22 23:09

Алексей