Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 or UTF-16 or UTF-32 or UCS-2

I am designing a new CMS but want to design it to fit all my future needs like Multilingual content so i was thinking Unicode (UTF-8) is the best solution

But with some search i got this article

http://msdn.microsoft.com/en-us/library/bb330962%28SQL.90%29.aspx#intlftrql2005_topic2

So i am now confused what to use now UTF-8 / UTF-16 / UTF-32 / UCS-2

which is better for Multilingual content and performance etc.

PS : i am using Asp.net and c# and SqlServer 2005

Thanks in advance

like image 728
Pola Edward Avatar asked Aug 13 '10 01:08

Pola Edward


People also ask

Is UCS-2 the same as UTF-16?

The UCS-2 standard, an early version of Unicode, is limited to 65 535 characters. However, the data processing industry needs over 94 000 characters; the UCS-2 standard has been superseded by the Unicode UTF-16 standard.

What is the difference between UTF-8 and UTF-16 and UTF-32?

Efficiency. UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.

What is UCS-2 encoding?

UCS-2 is a character encoding standard in which characters are represented by a fixed-length 16 bits (2 bytes). It is used as a fallback on many GSM networks when a message cannot be encoded using GSM-7 or when a language requires more than 128 characters to be rendered.

Should I use UTF-8 or UTF-16?

There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use: - utf-8 for storage and comunication - utf-16 for data processing - you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).


2 Answers

Quick note: basically everything can be represented in the unicode character set. UTF-8 is just one encoding that's able to represent all of the characters in this set.

UCS-2 is not really a thing to use anymore. It can't hold characters beyond U+FFFF.

Which of the remaining three depends on what kind of operations you want to do on the text. UTF-8 (usually, not always!) will take up less space on disk representing the same data, and is a strict superset of ASCII, so it might reduce the amount of transcoding needed. However, you can't index your string or find its length in constant time.

UTF-32 does allow you to find the length of the string and index it in constant time. It isn't a superset of ASCII like UTF-8 is. It does also require you to have 4 bytes per code point, but hey, disk space is cheap.

like image 98
habnabit Avatar answered Sep 22 '22 18:09

habnabit


UTF-8 or UTF-16 are both good choices. They both give you access to the full range of Unicode code points without using up 4 bytes for every character.

Your choice will be influenced by the language you're using and its support for these formats. I believe UTF-8 plays best with ASP.NET overall, but it will depend on what you're doing.

UTF-8 is often a good choice overall because it plays well with code that expects only ASCII, whereas UTF-16 doesn't. It is also the most efficient way of representing content largely consisting of our English alphabet, while still allowing the full repertoire of Unicode when needed. A good reason for choosing UTF-16 would be if your language/framework used it natively, or if you're going to be mainly using characters that aren't in ASCII, such as Asian languages.

like image 44
thomasrutter Avatar answered Sep 22 '22 18:09

thomasrutter