Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why UTF-32 instead of UTF-16 if we have surrogate pairs?

If I understand correctly, UTF-32 can handle every character in the universe. So can UTF-16, through the use of surrogate pairs. So is there any good reason to use UTF-32 instead of UTF-16?

like image 394
zildjohn01 Avatar asked Mar 09 '09 04:03

zildjohn01


People also ask

Why is UTF-32 rarely used?

The main disadvantage of UTF-32 is that it is space-inefficient, using four bytes per code point, including 11 bits that are always zero. Characters beyond the BMP are relatively rare in most texts (except for e.g. texts with some popular emojis), and can typically be ignored for sizing estimates.

Why a character in UTF-32 takes more space?

UTF-32 uses four bytes per character regardless of what character it is, so it will always use more space than UTF-8 to encode the same string.

What is the difference between UTF-8 and UTF-16 and UTF-32?

UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.

What is surrogate encoding?

The surrogate code points are used in UTF-16 to represent code points beyond FFFF . They are used in pairs, so a character is made of 4 bytes. This mechanism is not needed in UTF-8, so text encoded with UTF-8 shouldn't contain them.


1 Answers

In UTF-32 a unicode character would always be represented by 4 bytes so parsing code would be easier to write than that of a UTF-16 string because in UTF-16 a character is represented by varying number of bytes. On the downside a UTF-32 chatacter would always require 4 bytes which can be wasteful if you are working mostly with say english characters. So its a design choice depending upon your requirements whether to use UTF-16 or UTF-32.

like image 62
Raminder Avatar answered Dec 07 '22 19:12

Raminder