Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup unescapes special characters

I'm using Jsoup to remove all the images from an HTML page. I'm receiving the page through an HTTP response - which also contains the content charset.

The problem is that Jsoup unescapes some special characters.

For example, for the input:

<html><head></head><body><p>isn&rsquo;t</p></body></html>

After running

String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>";
Document doc = Jsoup.parse(check);
System.out.println(doc.outerHtml());

I get:

<html><head></head><body><p>isn’t</p></body></html><p></p>

I want to avoid changing the html in any other way except for removing the images.

By using the command:

doc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended);

I do get the correct output but I'm sure there are cases where that charset won't be good. I just want to use the charset specified in the HTTP header and I'm afraid this will change my document in ways I can't predict. Is there any other cleaner method for removing the images without changing anything else inadvertently?

Thank you!

like image 648
dlvhdr Avatar asked Dec 19 '15 08:12

dlvhdr


1 Answers

Here is a workaround not involving any charset except the one specified in the HTTP header.

String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>".replaceAll("&([^;]+?);", "**$1;");

Document doc = Jsoup.parse(check);

doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);

System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));

OUTPUT

<html><head></head><body><p>isn&rsquo;t</p></body></html>

DISCUSSION

I wish there was a solution in Jsoup's API - @dlv

Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.

Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode &rsquo;. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.

Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;), decimal escape (&#151;), the original escape sequence (&rsquo;) or write the encoded character (which is the case in your post).

like image 91
Stephan Avatar answered Oct 07 '22 15:10

Stephan