Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is UTF-8 enough for all common languages?

I just wanted to develop a translation app in a Django projects which enables registered users with certain permissions to translate every single message it appears in latest version.

My question is, what character set should I use for database tables in this translation app? Looks like some european language characters cannot be stored in UTF-8?

like image 424
jack Avatar asked Mar 13 '10 15:03

jack


People also ask

Can UTF-8 handle all languages?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.

Does Unicode cover all languages?

The simplest answer is that Unicode covers all of the languages that can be written in the following widely-used scripts: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, ...

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.


2 Answers

Looks like some european language characters cannot be stored in UTF-8?

Not true. UTF-8 can store any character set without limitations except maybe for Klingon. UTF-8 is your one stop shop for internationalization. If you have problems with characters, they are most likely to be encoding problems, or missing support for that character range in the font you're using to display the data with (Extremely unlikely for a european language character though, but common e.g. when viewing indian sites on an european computer. See also this question)

If a non-western character set can't be rendered, it could be that the user's built in font does not have that range of UTF-8 covered.

Update: Klingon it is indeed not part of official UTF-8:

Some modern invented scripts which have not yet been included in Unicode (e.g., Tengwar) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., Klingon) are listed in the ConScript Unicode Registry, along with unofficial but widely-used Private Use Area code assignments.

However, there is a volunteer project that has inofficially assigned code points F8D0-F8FF in the private area to Klingon. Gallery of Klingon characters

like image 123
Pekka Avatar answered Sep 29 '22 06:09

Pekka


UTF-8 can be used to represent all of Unicode, so it doesn't let you express all common languages. It allows you to express all languages.

If it seems as if some european characters aren't working, that's an encoding issue.

like image 33
Williham Totland Avatar answered Sep 29 '22 07:09

Williham Totland