Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which Languages Does UTF-8 Not Support?

I'm working on internationalizing one of my programs for work. I'm trying to use foresight to avoid possible issues or redoing the process down the road.

I see references for UTF-8, UTF-16 and UTF-32. My question is two parts:

  1. What languages does UTF-8 not support?
  2. What advantages do UTF-16 and UTF-32 have over UTF-8?

If UTF-8 works for everything, then I'm curious what the advantage of UTF-16 and UTF-32 are (e.g. special search features in a database, etc) Having the understanding should help me finish designing my program (and database connections) properly. Thanks!

like image 592
James Oravec Avatar asked Mar 27 '13 16:03

James Oravec


People also ask

Does UTF-8 support all languages?

Content. UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.

What characters are not allowed in UTF-8?

0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.

Is Japanese supported in UTF-8?

The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.

Does UTF-8 support Arabic?

UTF-8 can store the full Unicode range, so it's fine to use for Arabic.


1 Answers

All three are just different ways to represent the same thing, so there are no languages supported by one and not another.

Sometimes UTF-16 is used by a system that you need to interoperate with - for instance, the Windows API uses UTF-16 natively.

In theory, UTF-32 can represent any "character" in a single 32-bit integer without ever needing to use more than one, whereas UTF-8 and UTF-16 need to use more than one 8-bit or 16-bit integer to do that. But in practise, with combining and non-combining variants of some codepoints, that's not really true.

One advantage of UTF-8 over the others is that if you have a bug whereby you're assuming that the number of 8-, 16- or 32-bit integers respectively is the same as the number of codepoints, it becomes obvious more quickly with UTF-8 - something will fail as soon as you have any non-ASCII codepoint in there, whereas with UTF-16 the bug can go unnoticed.

To answer your first question, here's a list of scripts currently unsupported by Unicode: http://www.unicode.org/standard/unsupported.html

like image 97
RichieHindle Avatar answered Oct 15 '22 17:10

RichieHindle