Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What could go wrong in switching HTML encoding from UTF-8 to UTF-16?

What are the implications of a change from UTF-8 to UTF-16 for HTML encoding? I would like to know your thoughts on the issue. Are there things I need to think of before making such a change?

Note: Interested due to enormous amounts of japanese and chinese text I need to handle.

like image 677
Newbie Avatar asked May 14 '09 19:05

Newbie


People also ask

What is the difference between UTF-8 and UTF-16?

The main difference between UTF-8, UTF-16, and UTF-32 character encoding is how many bytes it requires to represent a character in memory. UTF-8 uses a minimum of one byte, while UTF-16 uses a minimum of 2 bytes.

Is UTF-16 better than UTF-8?

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.

Why is UTF-16 not used?

If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-32, since UTF-16 is still variable length. If you don't need this, then UTF-16 seems like a colossal waste of space compared to UTF-8.

What is UTF-8 encoding why is it used for HTML files?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”


5 Answers

I can think of a few things that will go wrong:

  1. You MUST specify that it's UTF-16 in the HTTP header. Unlike UTF-8, UTF-16 is not ASCII compatible, which means that everything needs to be in UTF-16 from the start.
  2. Older clients don't support UTF-16. For example, anything on Windows 9x. Possibly Mac OS9 as well.
  3. Oh, wait, I almost forgot: North America and European copies of Windows XP don't have Asian fonts installed by default.
like image 187
Powerlord Avatar answered Oct 01 '22 09:10

Powerlord


  • Your bandwidth consumption is likely to nearly double, assuming most of your HTML is ASCII
  • Clients which incorrectly assume UTF-8 (or ASCII) will be confused

Why do you want to change to UTF-16?

like image 25
Jon Skeet Avatar answered Oct 02 '22 09:10

Jon Skeet


There is also the byte order which becomes an issue with anything above 8-bit data. UTF encoded files begin with a byte order mark which is used to determine the byte order, or endianness, of that file.

Wikipedia has a quite good explanation of this.

like image 22
FeatureCreep Avatar answered Oct 04 '22 09:10

FeatureCreep


As far as I know all modern browsers support UTF-16 encoding. But as others have pointed out, you should declare the encoding explicitly. Not all browsers and platforms will support all unicode characters, but I think this is regardless of which encoding you use.

However, if bandwith is a big issue you should probably consider gzipping the HTML. This will save much more bandwidth than switching encoding.

like image 25
JacquesB Avatar answered Oct 04 '22 09:10

JacquesB


Very nice article you have held here. Fundamentals states, "When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. US-ASCII is upwards-compatible with UTF-8 (an US-ASCII string is also a UTF-8 string, see [RFC 3629]), and UTF-8 is therefore appropriate if compatibility with US-ASCII is desired." In practice, compatibility with US-ASCII is so useful it's almost a requirement. The W3C wisely explains, "In other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Possible reasons for choosing one of these include efficiency of internal processing and interoperability with other processes."

like image 28
web marketing melbourne Avatar answered Oct 04 '22 09:10

web marketing melbourne