Why is there no UTF-24? [duplicate]

Tags:

Possible Duplicate:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

The maximum Unicode code point is 0x10FFFF in UTF-32. UTF-32 has 21 information bits and 11 superfluous blank bits. So why is there no UTF-24 encoding (i.e. UTF-32 with the high byte removed) for storing each code point in 3 bytes rather than 4?

947

asked Apr 13 '12 15:04

Anthony Faull

1 Answers

Well, the truth is : UTF-24 was suggested in 2007 :

http://unicode.org/mail-arch/unicode-ml/y2007-m01/0057.html

The mentioned pros & cons being :

"UTF-24  Advantages:   1. Fixed length code units.   2. Encoding format is easily detectable for any content, even if mislabeled.   3. Byte order can be reliably detected without the use of BOM, even for single-code-unit data.   4. If octets are dropped / inserted, decoder can resync at next valid code unit.   5. Practical for both internal processing and storage / interchange.   6. Conversion to code point scalar values is more trivial then for UTF-16 surrogate pairs      and UTF-7/8 multibyte sequences.   7. 7-bit transparent version can be easily derived.   8. Most compact for texts in archaic scripts.  Disadvantages:   1. Takes more space then UTF-8/16, except for texts in archaic scripts.   2. Comparing to UTF-32, extra bitwise operations required to convert to code point scalar values.   3. Incompatible with many legacy text-processing tools and protocols. "

As pointed out by David Starner in http://www.mail-archive.com/[email protected]/msg16011.html :

Why? UTF-24 will almost invariably be larger then UTF-16, unless you are talking a document in Old Italic or Gothic. The math alphanumberic characters will almost always be combined with enough ASCII to make UTF-8 a win, and if not, enough BMP characters to make UTF-16 a win. Modern computers don't deal with 24 bit chunks well; in memory, they'd take up 32 bits a piece, unless you declared them packed, and then they'd be a lot slower then UTF-16 or UTF-32. And if you're storing to disk, you may as well use BOCU or SCSU (you're already going non-standard), or use standard compression with UTF-8, UTF-16, BOCU or SCSU. SCSU or BOCU compressed should take up half the space of UTF-24, if that.

You could also check the following StackOverflow post :

Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

191

answered Sep 19 '22 14:09

Skippy Fastol

Related questions
                            
                                JavaScript remove ZERO WIDTH SPACE (unicode 8203) from string
                            
                                Using Haskell to output a UTF-8-encoded ByteString
                            
                                Reverse a string with accent chars?
                            
                                How to specify a unicode character using QString?
                            
                                Flex(lexer) support for unicode
                            
                                How to write unicode cross symbol in Java?
                            
                                How to convert Emoji from Unicode in PHP?
                            
                                Vertically center dots with CSS
                            
                                String To Lower/Upper in C++
                            
                                Unicode stored in C char
                            
                                How to use Special Chars in Java/Eclipse
                            
                                How does UTF-8 encoding identify single byte and double byte characters?
                            
                                Japanese characters looking like Chinese on Android
                            
                                Write UTF-8 files from R
                            
                                Why do those Thai characters display on the web page with a long tail?
                            
                                Do C++11 regular expressions work with UTF-8 strings?
                            
                                Convert between std::u8string and std::string
                            
                                Why is python decode replacing more than the invalid bytes from an encoded string?
                            
                                Why is IE failing to show UTF-8 encoded text?
                            
                                Fast way to filter illegal xml unicode chars in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is there no UTF-24? [duplicate]

Tags:

character-encoding

unicode

utf-32

Anthony Faull

People also ask

1 Answers

Skippy Fastol

Recent Activity

Donate For Us