Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between UTF-32 and UCS-4?

What is the difference between UTF-32 and UCS-4 ? Isn't UTF-32 supposed to be a fixed-width encoding ?

like image 466
Virus721 Avatar asked May 12 '15 09:05

Virus721


People also ask

What is UCS 4 encoding?

UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 bits for each Unicode code point. All other Unicode transformation formats use variable-length encodings. The UTF-32 form of a character is a direct representation of its codepoint.

What is UTF-32 used for?

UTF-32 allows characters to be encoded as 4 bytes at any code point from 00000000 to 0010FFFF. For example, the string ABC in UTF-32 is encoded as x"000000410000004200000043" .

Why is UTF-32 rarely used?

The main disadvantage of UTF-32 is that it is space-inefficient, using four bytes per code point, including 11 bits that are always zero. Characters beyond the BMP are relatively rare in most texts (except for e.g. texts with some popular emojis), and can typically be ignored for sizing estimates.

Is UTF-8 better than UTF-16?

UTF-16 is only more efficient than UTF-8 on some non-English websites. If a website uses a language with characters farther back in the Unicode library, UTF-8 will encode all characters as four bytes, whereas UTF-16 might encode many of the same characters as only two bytes.


2 Answers

The Unicode Standard Version 8.0, Appendix C states:

UCS-4 stands for “Universal Character Set coded in 4 octets.” It is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in ISO 10646 (Universal Coded Character Set).

like image 87
Jonathan Maddox Avatar answered Oct 03 '22 06:10

Jonathan Maddox


UTF-32 has started as a subset of UCS-4. Now it is identical except that the UTF-32 standard has additional Unicode semantics. See details on wikipedia:

The original ISO 10646 standard defines a 31-bit encoding form called UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF.

Because only 17 planes are actually in use, all current code points are between 0 and 0x10FFFF. UTF-32 is a subset of UCS-4 that uses only this range. Since the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, UTF-32 will be able to represent all Unicode characters. Accordingly, UCS-4 and UTF-32 are now identical except that the UTF-32 standard has additional Unicode semantics.

However, I am not exactly sure, what additional Unicode semantics means. Maybe someone can provide a better answer.

like image 41
Christian Gollhardt Avatar answered Oct 03 '22 05:10

Christian Gollhardt