Problem description:
Mathematica use
"\:nnnn"
as the syntax for unicode input. E.g.,
if we enter
"\:6c34"
, we get "水"
("water" in Chinese).
But what if one wants to enter "\:1f618"
(face throwing a kiss).
When I tried this, I got "ὡ8"
, not "a face throwing a kiss"
.
So, Mathematica evaluates "\:1f61"
before I entered "8"
.
Question: How can we delay this evaluation or how can we enter any unicode input in general (as for hexadecimal with more than 4 characters)?
Software and hardware platform: I am running Mathematica 8 on an Intel Mac. I tried both the command line version of Mathematica and Mathematica notebook, they behave the same.
Thank you.
Reflections: Unicode is an extensible standard and it can grow (and it does grow:)). Software systems that implement this standard may only implement a subset of this standard in order to be valid and useful (8-bit, 16-bit or 32-bit encoding). One, as the user of a certain software package, should not make the assumption that once the software says it support unicode, it support the universal set of unicode.
Entering Ctrl + ⇧ Shift + u , releasing, then typing the hex digits and pressing ↵ Enter (or Space or even, on some systems, pressing and releasing ⇧ Shift or Ctrl ).
Unicode is a universal character set. It is aimed to include all the characters needed for any writing system or language. The first code point positions in Unicode use 16 bits to represent the most commonly used characters in a number of languages. This Basic Multilingual Plane allows for 65,536 characters.
As of Unicode version 15.0, there are 149,186 characters with code points, covering 161 modern and historical scripts, as well as multiple symbol sets.
Non-BMP characters are represented by an ordered pair (called a Surrogate Pair in unicode vocabulary) of two 16-bit codes. Even though non-BMP characters are human readable as a single character, Javascript's internal storage still treats them as two characters.
Short answer: You can't do this because Mathematica doesn't support these characters properly. See at the end of the post for some workarounds.
Just to clear up some things:
There's no need for a 32-bit encoding to handle more than ~65000 Unicode characters. The most common encodings used for Unicode, UTF-8 and UTF-16, are multibyte encodings, meaning that a variable number of bytes are used to represent characters. UTF-16 can use either 2 or 4 bytes to represent a character. The Mathematica kernel will interpret every 2-byte sequence as a single character in a string, resulting in some invalid characters on occasion (when encountering a 4-byte sequence). This may be considered a bug. The front end is quite moody about how it handles 4-byte sequences, which is definitely a bug.
Limited workaround
When working strictly in the kernel (e.g. reading the Unicode data from a file), I sometimes use this function as a workaround to get the actual Unicode code point of 2-unit (4-byte) UTF-16 sequences:
toCodePoint[{a_, b_}] /; 16^^d800 <= a <= 16^^dbff && 16^^dc00 <= b <= 16^^dfff := (a - 16^^d800)*2^10 + (b - 16^^dc00) + 16^4
You can use
Split[ToCharacterCode[str], If[16^^d800 <= # <= 16^^dbff, True] &]
to split a UTF-16 string into Unicode characters correctly (either length-one or length-two, depending on the character).
This is an ugly and inconvenient workaround, and it will won't allow you to display anything of these characters in the front end unless you come up with some hack for that as well, e.g. importing the glyph reference images from unicode.org (at least for CJK they have them).
See also
See my earlier question on the same topic: Reading an UTF-8 encoded text file in Mathematica
If you are going to work with Chinese, you may come across this other problem too: Getting the Mathematica front end to obey the FontFamily option
According to this page in the Mathematica 8 help:
Mathematica supports both 8- and 16-bit raw character encodings.
Presumably they are saying that they don't support 32-bit encodings as would be needed to support your desired character.
As further evidence (in the absence of a clear statement in the documentation), the list of supported encodings on the same page has no 32-bit encodings. 32-bit encodings are apparently only supported in MathLink. I suppose there hasn't been enough user demand.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With