Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a programming language with full and correct Unicode support?

Most programming languages have some support for Unicode, but all have some more or less documented corner cases, where things won't work correctly.


Examples

Java: reverse() in StringBuilder/StringBuffer work correctly. But length(), charAt(), etc. in String do not if a character needs more than 16bit to encode.

C#: Didn't find a correct reverse method, Length and indexed access return wrong results.

Perl: Same problem.

PHP: Does not have an idea of Unicode at all, mbstring has some better working replacements.


I wonder if there is a programming language, which has full and correct Unicode support? What compromises had to be made there to achieve such a thing?

  • More complex algorithms?
  • Higher memory consumption?
  • Slower performance?

How was it implemented internally?

  • Array of Ints, Linked Lists, etc.
  • Additional buffering

I saw that Python 3 had some pretty big changes in this area. How close is Python 3 now to a correct implementation?

like image 909
soc Avatar asked Jul 24 '10 13:07

soc


2 Answers

The Java implementation is correct in the sense that is does not violate the Unicode standard; there is no prescription that string indexing work on code points instead of code units, and the behavior is documented. The Unicode standard gives implementors great freedom concerning optimizations as long as no invalid string is leaked. Concerning “full support”, that’s even harder to define. The Unicode standard generally doesn’t require that certain features be implemented to be Unicode-compatible; only that the features that are implemented are implemented according to the standard. Huge parts concerning script processing belong to fonts or the operating system, which programming systems cannot control. If you want to judge about the Unicode support of certain technologies, you can start by testing the following (subjective and non-exhaustive) list of topics:

  • Does the system have a string datatype that uses a Unicode encoding?
  • Are all Unicode (UTF) encodings supported that are described in the standard?
  • Normalization
  • The Bidirectional Algorithm
  • Is UpperCase("ß") = "SS"?
  • Is upper-casing locale sensitive? (e.g. in Turkish, UpperCase("i") = "İ")
  • Are there functions to work with code points instead of code units?
  • Unicode regular expressions
  • Does the system raise exceptions when invalid code unit sequences are encountered during decoding?
  • Access to Unicode Database properties?

I think the Java and .NET answer to these questions is mostly “yes”, while the Python 3.x answer is almost always “no.”

like image 81
Philipp Avatar answered Oct 12 '22 23:10

Philipp


Go, the new language developed at Google invented by Ken Thompson and Rob Pike and the C dialect in Plan9 from Bell Labs were built with Unicode in mind (UTF-8 was invented there, at Bell Labs, by Ken Thompson).

like image 21
Aram Hăvărneanu Avatar answered Oct 13 '22 01:10

Aram Hăvărneanu