Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How well does your language support unicode in practice?

I'm looking into new languages, kind of craving for one where I no longer need to worry about charset problems amongst inordinate amounts of other niggles I have with PHP for a new project.

I tend to find Java too verbose and messy, and my not wanting to touch Windows with a 6-foot pole tends to rule out .Net. That leaves essentially everything else -- except PHP, C and C++ (the latter two of which I know get messy with unicode stuff irrespective of the ICU library).

I've short listed a few languages to date, namely Ruby (loved the mixins), Python, Lisp and Javascript (node.js). However, I'm coming with highly inconsistent information on unicode support and I'm dreading (lack of time...) to learn each and every one of them to the point where I can safely break it to rule it out.

In so far as I understood, Python 3 seems to have it. As does Ruby 1.9. Lisp not necessarily. Javascript presumably.

There's arguably more than unicode support to a language, but in my experience it tends to become a major drawback when dealing with locale.

I also realize the question is somewhat subjective. (Please don't close it on that grounds: I'm actually linking to several SO threads which I found unsatisfying.) But... as a user of any of these languages, how well do they support unicode in practice?

like image 841
Denis de Bernardy Avatar asked Jun 09 '11 00:06

Denis de Bernardy


People also ask

What language uses Unicode?

The simplest answer is that Unicode covers all of the languages that can be written in the following widely-used scripts: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, ...

What is Unicode how is it useful?

Unicode is a universal character encoding standard. This standard includes roughly 100000 characters to represent characters of different languages. While ASCII uses only 1 byte the Unicode uses 4 bytes to represent characters. Hence, it provides a very wide variety of encoding.

Why Unicode is the most widely used?

It is designed for best interoperability with both ASCII and ISO-8859-1, the most widely used character sets, to make it easier for Unicode to be used in applications and protocols. The Unicode Standard is in use today, and it is the preferred character set for the Internet, especially for HTML and XML.

What languages does utf8 support?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.


4 Answers

Python's unicode support did not really change in 3.x. The unicode support in Python has been pretty much the same since Python 2.x, which introduced the separate unicode type and the encoding handling. What Python 3.x changes is that unicode becomes the only string type (and is renamed to str), whereas 2.x has bytestrings (str, "...") and unicode strings (unicode, u"...") that often but not always don't quite mix. (Allowing them to mix was an attempt to make transitioning from bytestrings to unicode easier, but it turned out a mistake.) All in all, Python's unicode support is quite good, mistakes in Python 2.x notwithstanding. There's unicode literals with numeric and named escapes, source-encoding declarations for non-ASCII characters in unicode literals, automatic encoding/decoding through the codecs module, unicode support in many libraries (like the regular expression and DB-API modules) and a builtin unicode database.

That said, you still need to know about encodings in order to handle text correctly. Your program will receive bytes in some encoding (be it from files, from environment variables or through other input) and they will need to be interpreted in that encoding. If you don't know the encoding (and can't determine it from the data, like in HTML or XML) you can really only process the data as bytes. If you do know the encoding, Python does allow you to deal with it mostly transparently.

like image 168
Thomas Wouters Avatar answered Nov 05 '22 02:11

Thomas Wouters


Perl has excellent support of unicode. You need to know how to use is properly, but i never find any language what has better unicode support than perl, especially now with perl5.14.

like image 45
jm666 Avatar answered Nov 05 '22 02:11

jm666


Racket (in the Lisp/Scheme camp) has good Unicode support. Racket distinguishes character strings (written "abc") from byte strings (written #"abc"). Character strings consist of Unicode characters and have all the Unicode-aware string operations one would expect (comparison, case folding, etc). By default Racket uses UTF-8 for character string I/O (including the encoding of source files), but it also supports conversion to and from other encodings. The GUI toolkit works with Unicode. So do regular expressions.

like image 27
Ryan Culpepper Avatar answered Nov 05 '22 03:11

Ryan Culpepper


From my personal experience, Ruby 1.9.2 handles unicode internally pretty good except some strange areas like upcase/downcase/capitalize methods for String class. I have to override them for all my Rails applications.

like image 30
Jake Jones Avatar answered Nov 05 '22 01:11

Jake Jones