Given that Unicode has been around for 18 years, why are there still apps that don't have Unicode support? Even my experiences with some operating systems and Unicode have been painful to say the least. As Joel Spolsky pointed out in 2003, it's not that hard. So what's the deal? Why can't we get it together?
It became apparent that a new character encoding scheme was needed, which is when the Unicode standard was created. The objective of Unicode is to unify all the different encoding schemes so that the confusion between computers can be limited as much as possible.
Because there are traditional ASCII characters in its first 127 positions, the program allocates each of these characters to its original ASCII value. Since Unicode is quickly becoming the universal code page of the web, all current Web standards rely on it.
Unicode is inconsistent with regards to which symbols get unique codes, and which do not. So that all of the accented letters of the European languages have their own code (Ő is 0150), but Native American symbols, like Guaraní g̃ have to be made up from two codes, 0067 (g) and 0303 (combining ~) or Dene Ų̀.
Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the ISO/IEC 8859 standard, which find wide usage in various countries of the world but remain largely incompatible with each other.
How often...
Do you know the difference between a collation and an encoding?
Where did you first heard of Unicode?
Have you ever, in your young days, experienced moving source files from a system in locale A to a system in locale B, edited a typo on system B, saved the files, b0rking all the non-ascii comments and... ending up wasting a lot of time trying to understand what happened? (did your editor mix things up? the compiler? the system? the... ?)
Did you end up deciding that never again you will comment your code using non-ascii characters?
Python
Did I mention on SO that I love Python? No? Well I love Python.
But until Python3.0, its Unicode support sucked. And there were all those rookie programmers, who at that time knew barely how to write a loop, getting UnicodeDecodeError
and UnicodeEncodeError
from nowhere when trying to deal with non-ascii characters. Well they basically got life-traumatized by the Unicode monster, and I know a lot of very efficient/experienced Python coders that are still frightened today about the idea of having to deal with Unicode data.
And with Python3, there is a clear separation between Unicode & bytestrings, but... look at how much trouble it is to port an application from Python 2.x to Python 3.x if you previously did not care much about the separation/if you don't really understand what Unicode is.
Databases, PHP
Do you know a popular commercial website that stores its international text as Unicode?
You will (perhaps) be surprised to learn that Wikipedia backend does not store its data using Unicode. All text is encoded in UTF-8 and is stored as binary data in the Database.
One key issue here is how to sort text data if you store it as Unicode codepoints. Here comes the Unicode collations, which define a sorting order on Unicode codepoints. But proper support for collations in Databases is missing/is in active development. (There are probably a lot of performance issues, too. -- IANADBA) Also, there is no widely-accepted standard for collations yet: for some languages, people don't agree on how words/letters/wordgroups should be sorted.
Have you heard of Unicode normalization? (Basically, you should convert your Unicode data to a canonical representation before storing it) Of course it's critical for Database storage, or local comparisons. But PHP for example only provides support for normalization since 5.2.4 which came out in August 2007.
And in fact, PHP does not completely supports Unicode yet. We'll have to wait PHP6 to get Unicode-compatible functions everywhere.
The Internet clearly helps spreading the Unicode trend. And it's a good thing. Initiatives like Python3 breaking changes help educating people about the issue. But we will have to wait patiently a bit more to see Unicode everywhere and new programmers instinctively using Unicode instead of Strings where it matters.
For the anecdote, because FedEx does not apparently support international addresses, the Google Summer of Code '09 students all got asked by Google to provide an ascii-only name and address for shipping. If you think that most business actors understand stakes behind Unicode support, you are just wrong. FedEx does not understand, and their clients do not really care. Yet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With