I'm working on internationalizing one of my programs for work. I'm trying to use foresight to avoid possible issues or redoing the process down the road.
I see references for UTF-8, UTF-16 and UTF-32. My question is two parts:
If UTF-8 works for everything, then I'm curious what the advantage of UTF-16 and UTF-32 are (e.g. special search features in a database, etc) Having the understanding should help me finish designing my program (and database connections) properly. Thanks!
Content. UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.
0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.
The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.
UTF-8 can store the full Unicode range, so it's fine to use for Arabic.
All three are just different ways to represent the same thing, so there are no languages supported by one and not another.
Sometimes UTF-16 is used by a system that you need to interoperate with - for instance, the Windows API uses UTF-16 natively.
In theory, UTF-32 can represent any "character" in a single 32-bit integer without ever needing to use more than one, whereas UTF-8 and UTF-16 need to use more than one 8-bit or 16-bit integer to do that. But in practise, with combining and non-combining variants of some codepoints, that's not really true.
One advantage of UTF-8 over the others is that if you have a bug whereby you're assuming that the number of 8-, 16- or 32-bit integers respectively is the same as the number of codepoints, it becomes obvious more quickly with UTF-8 - something will fail as soon as you have any non-ASCII codepoint in there, whereas with UTF-16 the bug can go unnoticed.
To answer your first question, here's a list of scripts currently unsupported by Unicode: http://www.unicode.org/standard/unsupported.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With