Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When escaping a string with HTML entities, can I safely skip encoding chars above Unicode 127 if I use UTF-8?

When outputting a string in HTML, one must escape special characters as HTML entities ("&<>" etc.) for understandable reasons.

I've examined two Java implementations of this: org.apache.commons.lang.StringEscapeUtils.escapeHtml(String) net.htmlparser.jericho.CharacterReference.encode(CharSequence)

Both escape all characters above Unicode code point 127 (0x7F), which is effectively all non-English characters.

This behavior is fine, but the strings it produces are non-human-readable when the characters are non-English (for example, in Hebrew or Arabic). I've seen that when chars above Unicode 127 aren't escaped like this, they still render correctly in browsers - I believe this is because the html page is UTF-8 encoded and thus these characters are understandable to the browser.

My question: Can I safely disable escaping Unicode characters above code point 127 when escaping HTML entities, provided my web page is UTF-8 encoded?

like image 317
Amos Avatar asked Feb 09 '11 09:02

Amos


People also ask

Is escaping same as encoding?

And what's the difference between escaping and encoding ? Encoding is transforming data from one format into another format. Escaping is a subset of encoding, where not all characters need to be encoded. Only some characters are encoded (by using an escape character).

What does escaping HTML do?

Escaping in HTML means, that you are replacing some special characters with others. In HTML it means usally, you replace e. e.g < or > or " or & . These characters have special meanings in HTML. And the text will appear as hello, world.

How do you escape a HTML character in Java?

In Java, we can use Apache commons-text , StringEscapeUtils. escapeHtml4(str) to escape HTML characters. In the old days, we usually use the Apache commons-lang3 , StringEscapeUtils class to escape HTML, but this class is deprecated as of 3.6.

When should you escape HTML?

EDIT - The reason for escaping is that special characters like & and < can end up causing the browser to display something other than what you intended. A bare & is technically an error in the html. Most browsers try to deal intelligently with such errors and will display them correctly in most cases.


2 Answers

You only need to use HTML entities under two circumstances:

  • To escape a character that has a special meaning in HTML (e.g. <)
  • To display a character that doesn't belong to the document encoding (e.g., the symbol in a ISO-8859-1 document)

Given that UTF-8 can represent all Unicode characters, only first case apply.

When typing HTML manually you may find practical to insert an HTML entity now and then if your editor and/or keyboard won't allow you to type certain character (it's easier to just type &copy; rather than trying to figure out how to type an actual ©) but when escaping text automatically you just make the page size grow ;-)

I know little about Java but other languages have different functions to encode special chars and all possible entities.

like image 163
Álvaro González Avatar answered Sep 28 '22 01:09

Álvaro González


If your send the encoding in the mime-type header:

Content-Type: text/html; charset=utf-8

then the browser will interpret your source as UTF-8 and you can send all those characters as normal UTF-8 encoded bytes.

Alternatively, you can specify the encoding in the header of your HTML page like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

This has the advantage that the information is stored with the HTML page if the user safes it and re-opens it from his harddisk at a later time.

Personally I'd do both (send the right header and add the meta-tag to your HTML page). It should be fine as long as the two places agree about the encoding.

Update: HTML 5 has added a new syntax for specifying the encoding:

<meta charset="utf-8">
like image 44
Joachim Sauer Avatar answered Sep 28 '22 01:09

Joachim Sauer