Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to enter non-BMP unicode (hexadecimal with more than 4 characters) as input to Mathematica

Problem description: Mathematica use "\:nnnn" as the syntax for unicode input. E.g., if we enter "\:6c34", we get "水" ("water" in Chinese). But what if one wants to enter "\:1f618" (face throwing a kiss). When I tried this, I got "ὡ8", not "a face throwing a kiss". So, Mathematica evaluates "\:1f61" before I entered "8".

Question: How can we delay this evaluation or how can we enter any unicode input in general (as for hexadecimal with more than 4 characters)?

Software and hardware platform: I am running Mathematica 8 on an Intel Mac. I tried both the command line version of Mathematica and Mathematica notebook, they behave the same.

Thank you.


Reflections: Unicode is an extensible standard and it can grow (and it does grow:)). Software systems that implement this standard may only implement a subset of this standard in order to be valid and useful (8-bit, 16-bit or 32-bit encoding). One, as the user of a certain software package, should not make the assumption that once the software says it support unicode, it support the universal set of unicode.

like image 301
Ning Avatar asked Nov 09 '11 00:11

Ning


People also ask

How do you enter Unicode hexadecimal?

Entering Ctrl + ⇧ Shift + u , releasing, then typing the hex digits and pressing ↵ Enter (or Space or even, on some systems, pressing and releasing ⇧ Shift or Ctrl ).

How many characters can you really store with 16-bit Unicode?

Unicode is a universal character set. It is aimed to include all the characters needed for any writing system or language. The first code point positions in Unicode use 16 bits to represent the most commonly used characters in a number of languages. This Basic Multilingual Plane allows for 65,536 characters.

How many possible character codes does Unicode allow you to use?

As of Unicode version 15.0, there are 149,186 characters with code points, covering 161 modern and historical scripts, as well as multiple symbol sets.

What are non BMP characters?

Non-BMP characters are represented by an ordered pair (called a Surrogate Pair in unicode vocabulary) of two 16-bit codes. Even though non-BMP characters are human readable as a single character, Javascript's internal storage still treats them as two characters.


2 Answers

Short answer: You can't do this because Mathematica doesn't support these characters properly. See at the end of the post for some workarounds.

Just to clear up some things:

There's no need for a 32-bit encoding to handle more than ~65000 Unicode characters. The most common encodings used for Unicode, UTF-8 and UTF-16, are multibyte encodings, meaning that a variable number of bytes are used to represent characters. UTF-16 can use either 2 or 4 bytes to represent a character. The Mathematica kernel will interpret every 2-byte sequence as a single character in a string, resulting in some invalid characters on occasion (when encountering a 4-byte sequence). This may be considered a bug. The front end is quite moody about how it handles 4-byte sequences, which is definitely a bug.

Limited workaround

When working strictly in the kernel (e.g. reading the Unicode data from a file), I sometimes use this function as a workaround to get the actual Unicode code point of 2-unit (4-byte) UTF-16 sequences:

toCodePoint[{a_, b_}] /; 16^^d800 <= a <= 16^^dbff && 16^^dc00 <= b <= 16^^dfff := (a - 16^^d800)*2^10 + (b - 16^^dc00) + 16^4

You can use

Split[ToCharacterCode[str], If[16^^d800 <= # <= 16^^dbff, True] &]

to split a UTF-16 string into Unicode characters correctly (either length-one or length-two, depending on the character).

This is an ugly and inconvenient workaround, and it will won't allow you to display anything of these characters in the front end unless you come up with some hack for that as well, e.g. importing the glyph reference images from unicode.org (at least for CJK they have them).

See also

See my earlier question on the same topic: Reading an UTF-8 encoded text file in Mathematica

If you are going to work with Chinese, you may come across this other problem too: Getting the Mathematica front end to obey the FontFamily option

like image 150
Szabolcs Avatar answered Oct 14 '22 09:10

Szabolcs


According to this page in the Mathematica 8 help:

Mathematica supports both 8- and 16-bit raw character encodings.

Presumably they are saying that they don't support 32-bit encodings as would be needed to support your desired character.

As further evidence (in the absence of a clear statement in the documentation), the list of supported encodings on the same page has no 32-bit encodings. 32-bit encodings are apparently only supported in MathLink. I suppose there hasn't been enough user demand.

like image 27
Codie CodeMonkey Avatar answered Oct 14 '22 11:10

Codie CodeMonkey