When escaping a string with HTML entities, can I safely skip encoding chars above Unicode 127 if I use UTF-8?

Tags:

When outputting a string in HTML, one must escape special characters as HTML entities ("&<>" etc.) for understandable reasons.

I've examined two Java implementations of this: org.apache.commons.lang.StringEscapeUtils.escapeHtml(String) net.htmlparser.jericho.CharacterReference.encode(CharSequence)

Both escape all characters above Unicode code point 127 (0x7F), which is effectively all non-English characters.

This behavior is fine, but the strings it produces are non-human-readable when the characters are non-English (for example, in Hebrew or Arabic). I've seen that when chars above Unicode 127 aren't escaped like this, they still render correctly in browsers - I believe this is because the html page is UTF-8 encoded and thus these characters are understandable to the browser.

My question: Can I safely disable escaping Unicode characters above code point 127 when escaping HTML entities, provided my web page is UTF-8 encoded?

317

asked Feb 09 '11 09:02

Amos

2 Answers

You only need to use HTML entities under two circumstances:

To escape a character that has a special meaning in HTML (e.g. <)
To display a character that doesn't belong to the document encoding (e.g., the € symbol in a ISO-8859-1 document)

Given that UTF-8 can represent all Unicode characters, only first case apply.

When typing HTML manually you may find practical to insert an HTML entity now and then if your editor and/or keyboard won't allow you to type certain character (it's easier to just type © rather than trying to figure out how to type an actual ©) but when escaping text automatically you just make the page size grow ;-)

I know little about Java but other languages have different functions to encode special chars and all possible entities.

163

answered Sep 28 '22 01:09

Álvaro González

If your send the encoding in the mime-type header:

Content-Type: text/html; charset=utf-8

then the browser will interpret your source as UTF-8 and you can send all those characters as normal UTF-8 encoded bytes.

Alternatively, you can specify the encoding in the header of your HTML page like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

This has the advantage that the information is stored with the HTML page if the user safes it and re-opens it from his harddisk at a later time.

Personally I'd do both (send the right header and add the meta-tag to your HTML page). It should be fine as long as the two places agree about the encoding.

Update: HTML 5 has added a new syntax for specifying the encoding:

<meta charset="utf-8">

answered Sep 28 '22 01:09

Joachim Sauer

Related questions
                            
                                My JUnit tests works when run in Eclipse, but sometimes randomly fails via Ant
                            
                                Doesn't autowiring limit goal of IoC?
                            
                                Save and re-use a request in a servlet filter?
                            
                                Best Location for Uploading file [duplicate]
                            
                                Life cycle of local Java objects created during a method call
                            
                                servlet set cookie secure?
                            
                                Change ListView background - strange behaviour
                            
                                What is the difference between all-static-methods and applying a singleton pattern?
                            
                                Class initialization and synchronized class method
                            
                                Java libraries and frameworks overviews
                            
                                Time series database for java?
                            
                                How to make Tomcat quick in loading changes to make Java web development fast
                            
                                how to stretch image
                            
                                Does Java have an equivalent to C#'s Environment.GetCommandLineArgs()?
                            
                                What system default date format to use?
                            
                                Is it possible to dynamically inject a parent into a class hierarchy in Java?
                            
                                ant buildfile setting javac location
                            
                                Getting the Class hierarchy in Java?
                            
                                How to forward request from web1/servlet to web2/servlet?
                            
                                Define the concept of a "port" in an UML Composite Structure Diagram

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When escaping a string with HTML entities, can I safely skip encoding chars above Unicode 127 if I use UTF-8?

Tags:

java

html

escaping

encoding

html-entities

Amos

People also ask

2 Answers

Álvaro González

Joachim Sauer

Recent Activity

Donate For Us