What is the difference between UTF-32 and UCS-4 ? Isn't UTF-32 supposed to be a fixed-width encoding ?
UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 bits for each Unicode code point. All other Unicode transformation formats use variable-length encodings. The UTF-32 form of a character is a direct representation of its codepoint.
UTF-32 allows characters to be encoded as 4 bytes at any code point from 00000000 to 0010FFFF. For example, the string ABC in UTF-32 is encoded as x"000000410000004200000043" .
The main disadvantage of UTF-32 is that it is space-inefficient, using four bytes per code point, including 11 bits that are always zero. Characters beyond the BMP are relatively rare in most texts (except for e.g. texts with some popular emojis), and can typically be ignored for sizing estimates.
UTF-16 is only more efficient than UTF-8 on some non-English websites. If a website uses a language with characters farther back in the Unicode library, UTF-8 will encode all characters as four bytes, whereas UTF-16 might encode many of the same characters as only two bytes.
The Unicode Standard Version 8.0, Appendix C states:
UCS-4 stands for “Universal Character Set coded in 4 octets.” It is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in ISO 10646 (Universal Coded Character Set).
UTF-32
has started as a subset of UCS-4
. Now it is identical except that the UTF-32 standard has additional Unicode semantics. See details on wikipedia:
The original ISO 10646 standard defines a 31-bit encoding form called UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF.
Because only 17 planes are actually in use, all current code points are between 0 and 0x10FFFF. UTF-32 is a subset of UCS-4 that uses only this range. Since the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, UTF-32 will be able to represent all Unicode characters. Accordingly, UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.
However, I am not exactly sure, what additional Unicode semantics
means. Maybe someone can provide a better answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With