If I understand correctly, UTF-32 can handle every character in the universe. So can UTF-16, through the use of surrogate pairs. So is there any good reason to use UTF-32 instead of UTF-16?
The main disadvantage of UTF-32 is that it is space-inefficient, using four bytes per code point, including 11 bits that are always zero. Characters beyond the BMP are relatively rare in most texts (except for e.g. texts with some popular emojis), and can typically be ignored for sizing estimates.
UTF-32 uses four bytes per character regardless of what character it is, so it will always use more space than UTF-8 to encode the same string.
UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.
The surrogate code points are used in UTF-16 to represent code points beyond FFFF . They are used in pairs, so a character is made of 4 bytes. This mechanism is not needed in UTF-8, so text encoded with UTF-8 shouldn't contain them.
In UTF-32 a unicode character would always be represented by 4 bytes so parsing code would be easier to write than that of a UTF-16 string because in UTF-16 a character is represented by varying number of bytes. On the downside a UTF-32 chatacter would always require 4 bytes which can be wasteful if you are working mostly with say english characters. So its a design choice depending upon your requirements whether to use UTF-16 or UTF-32.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With