I am designing a new CMS but want to design it to fit all my future needs like Multilingual content so i was thinking Unicode (UTF-8) is the best solution
But with some search i got this article
http://msdn.microsoft.com/en-us/library/bb330962%28SQL.90%29.aspx#intlftrql2005_topic2
So i am now confused what to use now UTF-8 / UTF-16 / UTF-32 / UCS-2
which is better for Multilingual content and performance etc.
PS : i am using Asp.net and c# and SqlServer 2005
Thanks in advance
The UCS-2 standard, an early version of Unicode, is limited to 65 535 characters. However, the data processing industry needs over 94 000 characters; the UCS-2 standard has been superseded by the Unicode UTF-16 standard.
Efficiency. UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.
UCS-2 is a character encoding standard in which characters are represented by a fixed-length 16 bits (2 bytes). It is used as a fallback on many GSM networks when a message cannot be encoded using GSM-7 or when a language requires more than 128 characters to be rendered.
There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use: - utf-8 for storage and comunication - utf-16 for data processing - you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).
Quick note: basically everything can be represented in the unicode character set. UTF-8 is just one encoding that's able to represent all of the characters in this set.
UCS-2 is not really a thing to use anymore. It can't hold characters beyond U+FFFF.
Which of the remaining three depends on what kind of operations you want to do on the text. UTF-8 (usually, not always!) will take up less space on disk representing the same data, and is a strict superset of ASCII, so it might reduce the amount of transcoding needed. However, you can't index your string or find its length in constant time.
UTF-32 does allow you to find the length of the string and index it in constant time. It isn't a superset of ASCII like UTF-8 is. It does also require you to have 4 bytes per code point, but hey, disk space is cheap.
UTF-8 or UTF-16 are both good choices. They both give you access to the full range of Unicode code points without using up 4 bytes for every character.
Your choice will be influenced by the language you're using and its support for these formats. I believe UTF-8 plays best with ASP.NET overall, but it will depend on what you're doing.
UTF-8 is often a good choice overall because it plays well with code that expects only ASCII, whereas UTF-16 doesn't. It is also the most efficient way of representing content largely consisting of our English alphabet, while still allowing the full repertoire of Unicode when needed. A good reason for choosing UTF-16 would be if your language/framework used it natively, or if you're going to be mainly using characters that aren't in ASCII, such as Asian languages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With