Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup having problems with special HTML symbols, ‘ — etc

I have some HTML (String) that I am putting through Jsoup just so I can add something to all href and src attributes, that works fine. However, I'm noticing that for some special HTML characters, Jsoup is converting them from say “ to the actual character . I output the value before and after and I see that change.

Before:

THIS — IS A “TEST”. 5 > 4. trademark: ™

After:

THIS — IS A “TEST”. 5 > 4. trademark: ?

What the heck is going on? I was specifically converting those special characters to their HTML entities before any Jsoup stuff to avoid this. The quotes changed to the actual quote characters, the greater-than stayed the same, and the trademark changed into a question mark. Aaaaaaa.

FYI, my Jsoup code is doing:

Document document = Jsoup.parse(fileHtmlStr);
//some stuff
String modifiedFileHtmlStr = document.html();

Thanks for any help!

like image 694
Michael K Avatar asked Sep 20 '13 14:09

Michael K


1 Answers

The code below will give similar to the input markup. It changes the escaping mode for specific characters and sets ASCII mode to escape the TM sign for systems which don't support Unicode.

The output:

<p>THIS &mdash; IS A &ldquo;TEST&rdquor;&period; 5 &gt; 4&period; trademark&colon; &#x99;</p>

The code:

Document doc = Jsoup.parse("" +
    "<p>THIS &mdash; IS A &ldquo;TEST&rdquo;. 5 &gt; 4. trademark: &#153;</p>");

Document.OutputSettings settings = doc.outputSettings();

settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.extended);
settings.charset("ASCII");

String modifiedFileHtmlStr = doc.html();

System.out.println(modifiedFileHtmlStr);
like image 180
Andrey Chaschev Avatar answered Oct 16 '22 07:10

Andrey Chaschev