What's the point of UTF-16?

Tags:

I've never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-32, since UTF-16 is still variable length. If you don't need this, then UTF-16 seems like a colossal waste of space compared to UTF-8. What are the advantages of UTF-16 over UTF-8 and UTF-32 and why do Windows and Java use it as their native encoding?

930

asked Mar 13 '11 20:03

dsimcha

2 Answers

When Windows NT was designed UTF-16 didn't exist (NT 3.51 was born in 1993, while UTF-16 was born in 1996 with the Unicode 2.0 standard); there was instead UCS-2, which, at that time, was enough to hold every character available in Unicode, so the 1 code point = 1 code unit equivalence was actually true - no variable-length logic needed for strings.

They moved to UTF-16 later, to support the whole Unicode character set; however they couldn't move to UTF-8 or to UTF-32, because this would have broken binary compatibility in the API interface (among the other things).

As for Java, I'm not really sure; since it was released in ~1995 I suspect that UTF-16 was already in the air (even if it wasn't standardized yet), but I think that compatibility with NT-based operating systems may have played some role in their choice (continuous UTF-8 <-> UTF-16 conversions for every call to Windows APIs can introduce some slowdown).

Edit

Wikipedia explains that even for Java it went in the same way: it originally supported UCS-2, but moved to UTF-16 in J2SE 5.0.

So, in general when you see UTF-16 used in some API/Framework it is because it started as UCS-2 (to avoid complications in the string-management algorithms) but it moved to UTF-16 to support the code points outside the BMP, still maintaining the same code unit size.

121

answered Sep 20 '22 03:09

Matteo Italia

None of the replies indicating an advantage of UTF-16 over UTF-8 make any sense, except for the backwards-compatibility reply.

Well, there are two caveats to my comment.

Erik states: "UTF-16 covers the entire BMP with single units - So unless you have a need for the rarer characters outside the BMP, UTF-16 is effectively 2 bytes per character."

Caveat 1)

If you can be certain that your application will NEVER need any character outside of the BMP, and that any library code you write for use with it will NEVER be used with any application that will ever need a character outside the BMP, then you could use UTF-16, and write code that makes the implicit assumption that every character will be exactly two bytes in length.

That seems exceedingly dangerous (actually, stupid).

If your code assumes that all UTF-16 characters are two bytes in length, and your program interacts with an application or library where there is a single character outside of the BMP, then your code will break. Code that examines or manipulates UTF-16 must be written to handle the case of a UTF-16 character requiring more than 2 bytes; therefore, I am "dismissing" this caveat.

UTF-16 is not simpler to code for than UTF-8 (code for both must handle variable-length characters).

Caveat 2)

UTF-16 MIGHT be more computationally efficient, under some circumstances, if suitably written.

Like this: Suppose that certain long strings are seldom modified, but often examined (or better, never modified once built - i.e., a string builder creating unmodifiable strings). A flag could be set for each string, indicating whether the string contains only "fixed length" characters (i.e., contains no characters that are not exactly two bytes in length). Strings for which the flag is true could be examined with optimized code that assumes fixed length (2 byte) characters.

How about space-efficiency?

UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8.

UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.

Except for very "specialized" text, it's likely that count(B) far exceeds count(A).

answered Sep 20 '22 03:09

user3162129

Related questions
                            
                                Removing "utf8=✓" from Rails 3 form submissions
                            
                                Fixing broken UTF-8 encoding
                            
                                Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?
                            
                                What's the difference between UTF8/UTF16 and Base64 in terms of encoding
                            
                                php substr() function with utf-8 leaves � marks at the end
                            
                                Listings in Latex with UTF-8 (or at least german umlauts)
                            
                                Tmux borders displayed as x q instead of lines?
                            
                                How to check whether a file is valid UTF-8?
                            
                                when we import csv data, how eliminate "invalid byte sequence in UTF-8"
                            
                                JVM property -Dfile.encoding=UTF8 or UTF-8?
                            
                                Encoding Error in Panda read_csv [duplicate]
                            
                                "Unmappable character for encoding UTF-8" error
                            
                                How to conduct an Accent Sensitive search in MySql
                            
                                MySQL command line formatting with UTF8
                            
                                Javascript: Unicode string to hex
                            
                                file_get_contents() Breaks Up UTF-8 Characters
                            
                                Why do some websites have ?utf8=✓ in their title?
                            
                                Java - Convert String to valid URI object
                            
                                how to detect invalid utf8 unicode/binary in a text file
                            
                                Conversion between UTF-8 ArrayBuffer and String

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the point of UTF-16?

Tags:

character-encoding

utf-8

utf-16

utf

utf-32

dsimcha

People also ask

2 Answers

Matteo Italia

user3162129

Recent Activity

Donate For Us