When outputting a string in HTML, one must escape special characters as HTML entities ("&<>" etc.) for understandable reasons.
I've examined two Java implementations of this: org.apache.commons.lang.StringEscapeUtils.escapeHtml(String) net.htmlparser.jericho.CharacterReference.encode(CharSequence)
Both escape all characters above Unicode code point 127 (0x7F), which is effectively all non-English characters.
This behavior is fine, but the strings it produces are non-human-readable when the characters are non-English (for example, in Hebrew or Arabic). I've seen that when chars above Unicode 127 aren't escaped like this, they still render correctly in browsers - I believe this is because the html page is UTF-8 encoded and thus these characters are understandable to the browser.
My question: Can I safely disable escaping Unicode characters above code point 127 when escaping HTML entities, provided my web page is UTF-8 encoded?
And what's the difference between escaping and encoding ? Encoding is transforming data from one format into another format. Escaping is a subset of encoding, where not all characters need to be encoded. Only some characters are encoded (by using an escape character).
Escaping in HTML means, that you are replacing some special characters with others. In HTML it means usally, you replace e. e.g < or > or " or & . These characters have special meanings in HTML. And the text will appear as hello, world.
In Java, we can use Apache commons-text , StringEscapeUtils. escapeHtml4(str) to escape HTML characters. In the old days, we usually use the Apache commons-lang3 , StringEscapeUtils class to escape HTML, but this class is deprecated as of 3.6.
EDIT - The reason for escaping is that special characters like & and < can end up causing the browser to display something other than what you intended. A bare & is technically an error in the html. Most browsers try to deal intelligently with such errors and will display them correctly in most cases.
You only need to use HTML entities under two circumstances:
<
)€
symbol in a ISO-8859-1 document)Given that UTF-8 can represent all Unicode characters, only first case apply.
When typing HTML manually you may find practical to insert an HTML entity now and then if your editor and/or keyboard won't allow you to type certain character (it's easier to just type ©
rather than trying to figure out how to type an actual ©) but when escaping text automatically you just make the page size grow ;-)
I know little about Java but other languages have different functions to encode special chars and all possible entities.
If your send the encoding in the mime-type header:
Content-Type: text/html; charset=utf-8
then the browser will interpret your source as UTF-8 and you can send all those characters as normal UTF-8 encoded bytes.
Alternatively, you can specify the encoding in the header of your HTML page like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
This has the advantage that the information is stored with the HTML page if the user safes it and re-opens it from his harddisk at a later time.
Personally I'd do both (send the right header and add the meta
-tag to your HTML page). It should be fine as long as the two places agree about the encoding.
Update: HTML 5 has added a new syntax for specifying the encoding:
<meta charset="utf-8">
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With