Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

4 Answers

Computers are generally much better at dealing with data on 4 byte boundaries. The benefits in terms of reduced memory consumption are relatively small compared with the pain of working on 3-byte boundaries.

(I speculate there was also a reluctance to have a limit that was "only what we can currently imagine being useful" when coming up with the original design. After all, that's caused a lot of problems in the past, e.g. with IPv4. While I can't see us ever needing more than 24 bits, if 32 bits is more convenient anyway then it seems reasonable to avoid having a limit which might just be hit one day, via reserved ranges etc.)

I guess this is a bit like asking why we often have 8-bit, 16-bit, 32-bit and 64-bit integer datatypes (byte, int, long, whatever) but not 24-bit ones. I'm sure there are lots of occasions where we know that a number will never go beyond 2²¹, but it's just simpler to use int than to create a 24-bit type.

128

answered Oct 17 '22 18:10

Jon Skeet

First there were 2 character coding schemes: UCS-4 that coded each character into 32 bits, as an unsigned integer in range 0x00000000 - 0x7FFFFFFF, and UCS-2 that used 16 bits for each codepoint.

Later it was found out that using just the 65536 codepoints of UCS-2 would get one into problems anyway, but many programs (Windows, cough) relied on wide characters being 16 bits wide, so UTF-16 was created. UTF-16 encodes the codepints in the range U+0000 - U+FFFF just like UCS-2; and U+10000 - U+10FFFF using surrogate pairs, i.e. a pair of two 16-bit values.

As this was a bit complicated, UTF-32 was introduced, as a simple one-to-one mapping for characters beyond U+FFFF. Now, since UTF-16 can only encode up to U+10FFFF, it was decided that this is will be the maximum value that will be ever assigned, so that there will be no further compatibility problems, so UTF-32 indeed just uses 21 bits. As an added bonus, UTF-8, which was initially planned to be a 1-6-byte encoding, now never needs more than 4 bytes for each code point. Therefore it can be easily proven that it never requires more storage than UTF-32.

It is true that a hypothetical UTF-24 format would save memory. However its savings would be dubious anyway, as it would mostly consume more storage than UTF-8, except for just blasts of emoji or such - and not many interesting texts of significant length consist solely of emojis.

But, UTF-32 is used as in memory representation for text in programs that need to have simply-indexed access to codepoints - it is the only encoding where the Nth element in a C array is also the Nth codepoint - UTF-24 would do the same for 25 % memory savings but more complicated element accesses.

answered Oct 17 '22 20:10

Antti Haapala -- Слава Україні

It's true that only 21 bits are required (reference), but modern computers are good at moving 32-bit units of things around and generally interacting with them. I don't think I've ever used a programming language that had a 24-bit integer or character type, nor a platform where that was a multiple of the processor's word size (not since I last used an 8-bit computer; UTF-24 would be reasonable on an 8-bit machine), though naturally there have been some.

answered Oct 17 '22 20:10

T.J. Crowder

UTF-32 is a multiple of 16bit. Working with 32 bit quantities is much more common than working with 24 bit quantities and is usually better supported. It also helps keep each character 4-byte aligned (assuming the entire string is 4-byte aligned). Going from 1 byte to 2 bytes to 4 bytes is the most "logical" procession.

Apart from that: The Unicode standard is ever-growing. Codepoints outside of that range could eventually be assigned (it is somewhat unlikely in the near future, however, due to the huge number of unassigned codepoints still available).

answered Oct 17 '22 20:10

Joachim Sauer

Related questions
                            
                                Java: Convert String "\uFFFF" into char
                            
                                Is there a field in which PDF files specify their encoding?
                            
                                What's the difference between hex code (\x) and unicode (\u) chars?
                            
                                Unicode Characters in ggplot2 PDF Output
                            
                                Thai line breaking: how to break Thai text effectively
                            
                                What does sorting mean in non-alphabetic (i.e, Asian) languages?
                            
                                Why is the return value of String.addingPercentEncoding() optional?
                            
                                How to use Unicode in C++?
                            
                                What is the Unicode U+001A Character? Aka 0x1A
                            
                                Writing pandas DataFrame to JSON in unicode
                            
                                Matching only a unicode letter in Python re
                            
                                Newline symbol unicode character
                            
                                How to read Unicode input and compare Unicode strings in Python?
                            
                                How to convert a unichar value to an NSString in Objective-C?
                            
                                UnicodeEncodeError only when running as a cron job [duplicate]
                            
                                The proper way to handle Unicode with C++ in 2018?
                            
                                Is there a way to programmatically determine if a font file has a specific Unicode Glyph?
                            
                                Unicode symbol that represent "download" [closed]
                            
                                Convert UTF-16 to UTF-8 under Windows and Linux, in C
                            
                                How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

Tags:

encoding

unicode

Sergey

People also ask

4 Answers

Jon Skeet

Antti Haapala -- Слава Україні

T.J. Crowder

Joachim Sauer

Recent Activity

Donate For Us