What Character Encoding is best for multinational companies

2 Answers

If you want to support a variety of languages for web content, you should use an encoding that covers the entire Unicode range. The best choice for this purpose is UTF-8. UTF-8 is the preferred encoding for the web; from the HTML5 draft standard:

Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]

Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]

UTF-8 and Windows-1252 are the only encodings required to be supported by browsers, and UTF-8 and UTF-16 are the only encodings required to be supported by XML parsers. UTF-8 is thus the only common encoding that everything is required to support.

The following is more of an expanded response to Liv's answer than an answer on its own; it's a description of why UTF-8 is preferable to UTF-16 even for CJK content.

For characters in the ASCII range, UTF-8 is more compact (1 byte vs 2) than UTF-16. For characters between the ASCII range and U+07FF (which includes Latin Extended, Cyrillic, Greek, Arabic, and Hebrew), UTF-8 also uses two bytes per character, so it's a wash. For characters outside the Basic Multilingual Plane, both UTF-8 and UTF-16 use 4 bytes per character, so it's a wash there.

The only range in which UTF-16 is more efficient than UTF-8 is for characters from U+07FF to U+FFFF, which includes Indic alphabets and CJK. Even for a lot of text in that range, UTF-8 winds up being comparable, because the markup of that text (HTML, XML, RTF, or what have you) is all in the ASCII range, for which UTF-8 is half the size of UTF-16.

For example, if I pick a random web page in Japanese, the home page of nhk.or.jp, it is encoded in UTF-8. If I transcode it to UTF-16, it grows to almost twice its original size:

$ curl -o nhk.html 'http://www.nhk.or.jp/'
$ iconv -f UTF-8 -t UTF-16 nhk.html > nhk.16.html
$ ls -al nhk*
-rw-r--r--  1 lambda  lambda  32416 Mar 13 13:06 nhk.16.html
-rw-r--r--  1 lambda  lambda  18337 Mar 13 13:04 nhk.html

UTF-8 is better in almost every way than UTF-16. Both of them are variable width encodings, and so have the complexity that entails. In UTF-16, however, 4 byte characters are fairly uncommon, so it's a lot easier to make fixed width assumptions and have everything work until you run into a corner case that you didn't catch. An example of this confusion can be seen in the encoding CESU-8, which is what you get if you convert UTF-16 text into UTF-8 by just encoding each half of a surrogate pair as a separate character (using 6 bytes per character; three bytes to encode each half of the surrogate pair in UTF-8), instead of decoding the pair to its codepoint and encoding that into UTF-8. This confusion is common enough that the mistaken encoding has actually been standardized so that at least broken programs can be made to interoperate.

UTF-8 is much smaller than UTF-16 for the vast majority of content, and if you're concerned about size, compressing your text will always do better than just picking a different encoding. UTF-8 is compatible with APIs and data structures that use a null-terminated sequence of bytes to represent strings, so as long as your APIs and data structures either don't care about encoding or can already handle different encodings in their strings (such as most C and POSIX string handling APIs), UTF-8 can work just fine without having to have a whole new set of APIs and data structures for wide characters. UTF-16 doesn't specify endianness, so it makes you deal with endianness issues; actually there are three different related encodings, UTF-16, UTF-16BE, and UTF-16LE. UTF-16 could be either big endian or little endian, and so requires a BOM to specify. UTF-16BE and LE are big and little endian versions, with no BOM, so you need to use an out-of-band method (such as a Content-Type HTTP header) to signal which one you're using, but out-of-band headers are notorious for being wrong or missing.

UTF-16 is basically an accident, that happened because people thought that 16 bits would be enough to encode all of Unicode at first, and so started changing their representation and APIs to use wide (16 bit) characters. When they realized they would need more characters, they came up with a scheme for using some reserved characters for encoding 32 bit values using two code units, so they could still use the same data structures for the new encoding. This brought all of the disadvantages of a variable-width encoding like UTF-8, without most of the advantages.

161

answered Sep 21 '22 02:09

Brian Campbell

UTF-8 is the de facto standard character encoding for Unicode.

UTF-8 is like UTF-16 and UTF-32, because it can represent every character in the Unicode character set. But unlike UTF-16 and UTF-32, it possesses the advantages of being backward-compatible with ASCII. And it has the advantage of avoiding the complications of endianness and the resulting need to use byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.

There is no such thing as UTF-128.

answered Sep 20 '22 02:09

Matt Ball

Related questions
                            
                                Perl Encode.pm cannot decode string with wide character
                            
                                Converting "normal" std::string to utf-8
                            
                                How do I ignore the UTF-8 Byte Order Marker in String comparisons?
                            
                                Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI
                            
                                How to uppercase/lowercase UTF-8 characters in C++?
                            
                                normalizing accented characters in MySQL queries
                            
                                How can I detect if a .NET StreamReader found a UTF8 BOM on the underlying stream?
                            
                                Which Languages Does UTF-8 Not Support?
                            
                                How to truncate an UTF8 string in PHP?
                            
                                How to validate if a UTF-8 string contains mal-encoded characters
                            
                                Handling special characters in C (UTF-8 encoding)
                            
                                python get unicode string size
                            
                                How to read UTF-8 text from file using Qt?
                            
                                Convert String (UTF-16) to UTF-8 in C#
                            
                                How can I detect non-western characters?
                            
                                What's ï»¿ sign at the beginning of my source file?
                            
                                IntelliJ IDEA console issue
                            
                                'UTF8' is not a supported encoding name
                            
                                How can I treat command-line arguments as UTF-8 in Perl?
                            
                                MySQL Invalid UTF8 character string when importing csv table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What Character Encoding is best for multinational companies

Tags:

character-encoding

utf-8

utf-16

utf-32

HGPB

People also ask

2 Answers

Brian Campbell

Matt Ball

Recent Activity

Donate For Us