I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like
{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}
Which should be perfectly valid but when read in through fromJSON()
I get:
snip...
[1] "Banach\023Tarski paradox"
Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with scan()
and readLines()
. My guess is that I am missing something very basic.
I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following:
"em\u2013dash" "em–dash" " em \u2013 dash"
Then load up R (for whatever the file path is):
> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash" "em–dash" " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""
The added escape character is what causes my problems with fromJSON()
. I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks.
Here's the session info:
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RJSONIO_0.98-0
loaded via a namespace (and not attached):
[1] tools_2.14.1
A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.
The em dash is encoded in Unicode as U+2014 (decimal 8212) and represented in HTML by the named character entity — .
"End of guarded area" encoded in utf-8 is the two-byte sequence: 0xC2 0x97. The text file was correctly interpreted as w-1252, thus the 0x97 is recognized as em dash, which was correctly encoded as the em dash in utf-8: 0xE2 0x80 0x94.
According to section 3.3 of the Java Language Specification (JLS) a unicode escape consists of a backslash character (\) followed by one or more 'u' characters and four hexadecimal digits.
This is not in fact a bug in RJSONIO. It is designed to expect a string that has been read by R and which has the non-ASCII characters already processed. When one passes it a string with \u, that has not been processed but escaped. On my machine with a locale set to en_US.UTF-8, the command
fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')
produces
$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0
$query$categorymembers[[1]]$title
[1] "Banach–Tarski paradox"
Note that the character is prefixed by \u
not \\u
.
See how it appears in R when you simply enter that string.
So the issue is upstream of fromJSON() as to why the string contains \u.
I may add support in RJSONIO for handling such unprocessed strings.
It is a bug in RJSONIO
as you can clearly see:
> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}')
x
"foo\023bar"
It works just fine in rjson
:
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo–bar"
and to prove it is the correct value:
> Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/en_US.UTF-8"
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo<U+2013>bar"
In your analysis you got confused by printed string vs actual strings. print
quotes its content for printing - if you want to see the actual string, you can use cat
or charToRaw
. Also scan
doesn't interpret any escapes, so you get what you give it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With