Is there a programming language with full and correct Unicode support?

Question

Most programming languages have some support for Unicode, but all have some more or less documented corner cases, where things won't work correctly.

Examples

Java: reverse() in StringBuilder/StringBuffer work correctly. But length(), charAt(), etc. in String do not if a character needs more than 16bit to encode.

C#: Didn't find a correct reverse method, Length and indexed access return wrong results.

Perl: Same problem.

PHP: Does not have an idea of Unicode at all, mbstring has some better working replacements.

I wonder if there is a programming language, which has full and correct Unicode support? What compromises had to be made there to achieve such a thing?

More complex algorithms?
Higher memory consumption?
Slower performance?

How was it implemented internally?

Array of Ints, Linked Lists, etc.
Additional buffering

I saw that Python 3 had some pretty big changes in this area. How close is Python 3 now to a correct implementation?

Philipp · Accepted Answer

The Java implementation is correct in the sense that is does not violate the Unicode standard; there is no prescription that string indexing work on code points instead of code units, and the behavior is documented. The Unicode standard gives implementors great freedom concerning optimizations as long as no invalid string is leaked. Concerning “full support”, that’s even harder to define. The Unicode standard generally doesn’t require that certain features be implemented to be Unicode-compatible; only that the features that are implemented are implemented according to the standard. Huge parts concerning script processing belong to fonts or the operating system, which programming systems cannot control. If you want to judge about the Unicode support of certain technologies, you can start by testing the following (subjective and non-exhaustive) list of topics:

Does the system have a string datatype that uses a Unicode encoding?
Are all Unicode (UTF) encodings supported that are described in the standard?
Normalization
The Bidirectional Algorithm
Is UpperCase("ß") = "SS"?
Is upper-casing locale sensitive? (e.g. in Turkish, UpperCase("i") = "İ")
Are there functions to work with code points instead of code units?
Unicode regular expressions
Does the system raise exceptions when invalid code unit sequences are encountered during decoding?
Access to Unicode Database properties?

I think the Java and .NET answer to these questions is mostly “yes”, while the Python 3.x answer is almost always “no.”

Aram Hăvărneanu · Answer

Go, the new language developed at Google invented by Ken Thompson and Rob Pike and the C dialect in Plan9 from Bell Labs were built with Unicode in mind (UTF-8 was invented there, at Bell Labs, by Ken Thompson).

Is there a programming language with full and correct Unicode support?

Tags:

language-agnostic

string

encoding

unicode

programming-languages

soc

2 Answers

Philipp

Aram Hăvărneanu

Recent Activity

Donate For Us

Is there a programming language with full and correct Unicode support?

Tags:

language-agnostic

string

encoding

unicode

programming-languages

soc

2 Answers

Philipp

Aram Hăvărneanu

Related questions

Recent Activity

Donate For Us