Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Emoji literals in Clojure source

On Linux with UTF-8 enabled console:

Clojure 1.6.0
user=> (def c \ใฎ)
#'user/c
user=> (str c)
"ใฎ"
user=> (def c \๐Ÿ’)

RuntimeException Unsupported character: \๐Ÿ’  clojure.lang.Util.runtimeException (Util.java:221)
RuntimeException Unmatched delimiter: )  clojure.lang.Util.runtimeException (Util.java:221)

I was hoping to have an emoji-rich Clojure application with little effort, but it appears I will be looking up and typing in emoji codes? Or am I missing something obvious here? ๐Ÿ˜ž

like image 637
noahlz Avatar asked May 28 '14 04:05

noahlz


2 Answers

Java represents Unicode characters in UTF-16. The emoji characters are "supplementary characters" and have a codepoint that cannot be represented in 16 bits.

http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html

In essence, supplementary characters are represented not as chars but as ints and there are special apis for dealing with them.

One way is with (Character/toChars 128516) - this returns a char array that you can convert to a string to print: (apply str (Character/toChars 128516)). Or you can create a String from an array of codepoint ints directly with (String. (int-array [128516]) 0 1). Depending on all the various things between Java/Clojure and your eyeballs, that may or may not do what you want.

The format api supports supplementary characters so that may be easiest, however it takes an int so you'll need a cast: (format "Smile! %c" (int 128516)).

like image 172
Alex Miller Avatar answered Sep 30 '22 17:09

Alex Miller


Thanks to Clojureโ€™s extensible reader tags, you can create Unicode literals quite easily yourself.

We already know that not all of Unicode can be represented as char literals; that the preferred representation of Unicode characters on the JVM is int; and that a string literal can hold any Unicode character in a way thatโ€™s also convenient for humans to read.

So, a tagged literal #u "๐Ÿ’" that reads as an int would make an excellent Unicode character literal!

Set up a reader function for the new tagged literal in *data-readers*:

(defn read-codepoint
  [^String s]
  {:pre [(= 1 (.codePointCount s 0 (.length s)))]}
  (.codePointAt s 0))

(set! *data-readers* (assoc *data-readers* 'u #'read-codepoint))

With that in place, the reader reads such literals as code point integers:

#u"๐Ÿ’"  ; => 127826
(Character/getName #u"๐Ÿ’")  ; => "CHERRIES"

โ€˜Reader tags without namespace qualifiers are reserved for Clojureโ€™, says the documentation โ€ฆ #u is short but perhaps not the most responsible choice.

like image 44
glts Avatar answered Sep 30 '22 16:09

glts