I'm using Jsoup to remove all the images from an HTML page. I'm receiving the page through an HTTP response - which also contains the content charset.
The problem is that Jsoup unescapes some special characters.
For example, for the input:
<html><head></head><body><p>isn’t</p></body></html>
After running
String check = "<html><head></head><body><p>isn’t</p></body></html>";
Document doc = Jsoup.parse(check);
System.out.println(doc.outerHtml());
I get:
<html><head></head><body><p>isn’t</p></body></html><p></p>
I want to avoid changing the html in any other way except for removing the images.
By using the command:
doc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended);
I do get the correct output but I'm sure there are cases where that charset won't be good. I just want to use the charset specified in the HTTP header and I'm afraid this will change my document in ways I can't predict. Is there any other cleaner method for removing the images without changing anything else inadvertently?
Thank you!
Here is a workaround not involving any charset except the one specified in the HTTP header.
String check = "<html><head></head><body><p>isn’t</p></body></html>".replaceAll("&([^;]+?);", "**$1;");
Document doc = Jsoup.parse(check);
doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);
System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));
OUTPUT
<html><head></head><body><p>isn’t</p></body></html>
DISCUSSION
I wish there was a solution in Jsoup's API - @dlv
Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.
Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode ’
. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.
Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;
), decimal escape (—
), the original escape sequence (’
) or write the encoded character (which is the case in your post).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With