Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correctly deal with escaped Unicode Characters in R e.g. the em dash (—)

Tags:

r

unicode

I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like

{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}

Which should be perfectly valid but when read in through fromJSON() I get:

snip...
[1] "Banach\023Tarski paradox"

Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with scan() and readLines(). My guess is that I am missing something very basic.

I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following:

"em\u2013dash" "em–dash" " em \u2013 dash"

Then load up R (for whatever the file path is):

> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash"    "em–dash"          " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""

The added escape character is what causes my problems with fromJSON(). I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks.

Here's the session info:

R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RJSONIO_0.98-0

loaded via a namespace (and not attached):
[1] tools_2.14.1
like image 212
Adam Hyland Avatar asked Feb 10 '12 06:02

Adam Hyland


People also ask

What is escape sequence how are they used to deal with unicode character?

A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.

How do you type an em dash in unicode?

The em dash is encoded in Unicode as U+2014 (decimal 8212) and represented in HTML by the named character entity — .

Is em dash a UTF 8 character?

"End of guarded area" encoded in utf-8 is the two-byte sequence: 0xC2 0x97. The text file was correctly interpreted as w-1252, thus the 0x97 is recognized as em dash, which was correctly encoded as the em dash in utf-8: 0xE2 0x80 0x94.

How do you escape unicode characters in Java?

According to section 3.3 of the Java Language Specification (JLS) a unicode escape consists of a backslash character (\) followed by one or more 'u' characters and four hexadecimal digits.


2 Answers

This is not in fact a bug in RJSONIO. It is designed to expect a string that has been read by R and which has the non-ASCII characters already processed. When one passes it a string with \u, that has not been processed but escaped. On my machine with a locale set to en_US.UTF-8, the command

fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')

produces

$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0

$query$categorymembers[[1]]$title
[1] "Banach–Tarski paradox"

Note that the character is prefixed by \u not \\u. See how it appears in R when you simply enter that string.

So the issue is upstream of fromJSON() as to why the string contains \u.
I may add support in RJSONIO for handling such unprocessed strings.

like image 104
Duncan Temple Lang Avatar answered Oct 06 '22 01:10

Duncan Temple Lang


It is a bug in RJSONIO as you can clearly see:

> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}')
           x 
"foo\023bar" 

It works just fine in rjson:

> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo–bar"

and to prove it is the correct value:

 > Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/en_US.UTF-8"
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo<U+2013>bar"

In your analysis you got confused by printed string vs actual strings. print quotes its content for printing - if you want to see the actual string, you can use cat or charToRaw. Also scan doesn't interpret any escapes, so you get what you give it.

like image 36
Simon Urbanek Avatar answered Oct 06 '22 00:10

Simon Urbanek