Do UTF-8, UTF-16, and UTF-32 differ in the number of characters they can store?

Tags:

Okay. I know this looks like the typical "Why didn't he just Google it or go to www.unicode.org and look it up?" question, but for such a simple question the answer still eludes me after checking both sources.

I am pretty sure that all three of these encoding systems support all of the Unicode characters, but I need to confirm it before I make that claim in a presentation.

Bonus question: Do these encodings differ in the number of characters they can be extended to support?

580

asked Sep 24 '08 22:09

JohnFx

2 Answers

There is no Unicode character that can be stored in one encoding but not another. This is simply because the valid Unicode characters have been restricted to what can be stored in UTF-16 (which has the smallest capacity of the three encodings). In other words, UTF-8 and and UTF-32 could be used to represent a wider range of characters than UTF-16, but they aren't. Read on for more details.

UTF-8

UTF-8 is a variable-length code. Some characters require 1 byte, some require 2, some 3 and some 4. The bytes for each character are simply written one after another as a continuous stream of bytes.

While some UTF-8 characters can be 4 bytes long, UTF-8 cannot encode 2^32 characters. It's not even close. I'll try to explain the reasons for this.

The software that reads a UTF-8 stream just gets a sequence of bytes - how is it supposed to decide whether the next 4 bytes is a single 4-byte character, or two 2-byte characters, or four 1-byte characters (or some other combination)? Basically this is done by deciding that certain 1-byte sequences aren't valid characters, and certain 2-byte sequences aren't valid characters, and so on. When these invalid sequences appear, it is assumed that they form part of a longer sequence.

You've seen a rather different example of this, I'm sure: it's called escaping. In many programming languages it is decided that the \ character in a string's source code doesn't translate to any valid character in the string's "compiled" form. When a \ is found in the source, it is assumed to be part of a longer sequence, like \n or \xFF. Note that \x is an invalid 2-character sequence, and \xF is an invalid 3-character sequence, but \xFF is a valid 4-character sequence.

Basically, there's a trade-off between having many characters and having shorter characters. If you want 2^32 characters, they need to be on average 4 bytes long. If you want all your characters to be 2 bytes or less, then you can't have more than 2^16 characters. UTF-8 gives a reasonable compromise: all ASCII characters (ASCII 0 to 127) are given 1-byte representations, which is great for compatibility, but many more characters are allowed.

Like most variable-length encodings, including the kinds of escape sequences shown above, UTF-8 is an instantaneous code. This means that, the decoder just reads byte by byte and as soon as it reaches the last byte of a character, it knows what the character is (and it knows that it isn't the beginning of a longer character).

For instance, the character 'A' is represented using the byte 65, and there are no two/three/four-byte characters whose first byte is 65. Otherwise the decoder wouldn't be able to tell those characters apart from an 'A' followed by something else.

But UTF-8 is restricted even further. It ensures that the encoding of a shorter character never appears anywhere within the encoding of a longer character. For instance, none of the bytes in a 4-byte character can be 65.

Since UTF-8 has 128 different 1-byte characters (whose byte values are 0-127), all 2, 3 and 4-byte characters must be composed solely of bytes in the range 128-256. That's a big restriction. However, it allows byte-oriented string functions to work with little or no modification. For instance, C's strstr() function always works as expected if its inputs are valid UTF-8 strings.

UTF-16

UTF-16 is also a variable-length code; its characters consume either 2 or 4 bytes. 2-byte values in the range 0xD800-0xDFFF are reserved for constructing 4-byte characters, and all 4-byte characters consist of two bytes in the range 0xD800-0xDBFF followed by 2 bytes in the range 0xDC00-0xDFFF. For this reason, Unicode does not assign any characters in the range U+D800-U+DFFF.

UTF-32

UTF-32 is a fixed-length code, with each character being 4 bytes long. While this allows the encoding of 2^32 different characters, only values between 0 and 0x10FFFF are allowed in this scheme.

Capacity comparison:

UTF-8: 2,097,152 (actually 2,166,912 but due to design details some of them map to the same thing)
UTF-16: 1,112,064
UTF-32: 4,294,967,296 (but restricted to the first 1,114,112)

The most restricted is therefore UTF-16! The formal Unicode definition has limited the Unicode characters to those that can be encoded with UTF-16 (i.e. the range U+0000 to U+10FFFF excluding U+D800 to U+DFFF). UTF-8 and UTF-32 support all of these characters.

The UTF-8 system is in fact "artificially" limited to 4 bytes. It can be extended to 8 bytes without violating the restrictions I outlined earlier, and this would yield a capacity of 2^42. The original UTF-8 specification in fact allowed up to 6 bytes, which gives a capacity of 2^31. But RFC 3629 limited it to 4 bytes, since that is how much is needed to cover all of what UTF-16 does.

There are other (mainly historical) Unicode encoding schemes, notably UCS-2 (which is only capable of encoding U+0000 to U+FFFF).

143

answered Sep 19 '22 05:09

Artelius

No, they're simply different encoding methods. They all support encoding the same set of characters.

UTF-8 uses anywhere from one to four bytes per character depending on what character you're encoding. Characters within the ASCII range take only one byte while very unusual characters take four.

UTF-32 uses four bytes per character regardless of what character it is, so it will always use more space than UTF-8 to encode the same string. The only advantage is that you can calculate the number of characters in a UTF-32 string by only counting bytes.

UTF-16 uses two bytes for most charactes, four bytes for unusual ones.

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

answered Sep 22 '22 05:09

skoob

Related questions
                            
                                ISO-8859-1 vs UTF-8?
                            
                                Programatic Accent Reduction in JavaScript (aka text normalization or unaccenting)
                            
                                Should Unicode be allowed in usernames? [closed]
                            
                                How to replace unicode characters in string with something else python?
                            
                                How to determine encoding table of a text file
                            
                                Why both UNICODE and _UNICODE?
                            
                                Why isn't everything we do in Unicode?
                            
                                \u200b (Zero width space) characters in my JS code. Where did they come from?
                            
                                is there a Unicode character for Copy and Paste?
                            
                                Is there a list of characters that look similar to English letters?
                            
                                Javascript parse error on '\u2028' unicode character
                            
                                hash unicode string in python
                            
                                Javascript RegExp + Word boundaries + unicode characters
                            
                                Should I support Unicode in passwords?
                            
                                Removing u in list
                            
                                Why does Java allow control characters in its identifiers?
                            
                                How can I get the Unicode code point(s) of a Character?
                            
                                Is there a Windows command shell that will display Unicode characters?
                            
                                What's the purpose of the noncharacters U+FDD0 to U+FDEF?
                            
                                UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With