Jsoup having problems with special HTML symbols, ‘ — etc

Question

I have some HTML (String) that I am putting through Jsoup just so I can add something to all href and src attributes, that works fine. However, I'm noticing that for some special HTML characters, Jsoup is converting them from say “ to the actual character “. I output the value before and after and I see that change.

Before:

THIS &mdash; IS A &ldquo;TEST&rdquo;. 5 &gt; 4. trademark: &#153;

After:

THIS — IS A “TEST”. 5 &gt; 4. trademark: ?

What the heck is going on? I was specifically converting those special characters to their HTML entities before any Jsoup stuff to avoid this. The quotes changed to the actual quote characters, the greater-than stayed the same, and the trademark changed into a question mark. Aaaaaaa.

FYI, my Jsoup code is doing:

Document document = Jsoup.parse(fileHtmlStr);
//some stuff
String modifiedFileHtmlStr = document.html();

Thanks for any help!

Andrey Chaschev · Accepted Answer

The code below will give similar to the input markup. It changes the escaping mode for specific characters and sets ASCII mode to escape the TM sign for systems which don't support Unicode.

The output:

<p>THIS &mdash; IS A &ldquo;TEST&rdquor;&period; 5 &gt; 4&period; trademark&colon; &#x99;</p>

The code:

Document doc = Jsoup.parse("" +
    "<p>THIS &mdash; IS A &ldquo;TEST&rdquo;. 5 &gt; 4. trademark: &#153;</p>");

Document.OutputSettings settings = doc.outputSettings();

settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.extended);
settings.charset("ASCII");

String modifiedFileHtmlStr = doc.html();

System.out.println(modifiedFileHtmlStr);

Jsoup having problems with special HTML symbols, ‘ — etc

Tags:

java

html-entities

jsoup

Michael K

1 Answers

Andrey Chaschev

Recent Activity

Donate For Us

Jsoup having problems with special HTML symbols, &lsquo; &mdash; etc

Tags:

java

html-entities

jsoup

Michael K

1 Answers

Andrey Chaschev

Related questions

Recent Activity

Donate For Us

Jsoup having problems with special HTML symbols, ‘ — etc