I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like <pre class="prettyprint"><code>{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}} </code></pre> Which should be perfectly valid but when read in through <code>fromJSON()</code> I get: <pre class="prettyprint"><code>snip... [1] "Banach\023Tarski paradox" </code></pre> Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with <code>scan()</code> and <code>readLines()</code>. My guess is that I am missing something very basic. I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following: <pre class="prettyprint"><code>"em\u2013dash" "em–dash" " em \u2013 dash" </code></pre> Then load up R (for whatever the file path is): <pre class="prettyprint"><code>> scan( file = "~/R/test1", what = "character", encoding = "UTF-8") Read 3 items [1] "em\\u2013dash" "em–dash" " em \\u2013 dash" > readLines("~/R/test1", warn = FALSE, encoding = "UTF-8") [1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\"" </code></pre> The added escape character is what causes my problems with <code>fromJSON()</code>. I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks. Here's the session info: <pre class="prettyprint"><code>R version 2.14.1 (2011-12-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RJSONIO_0.98-0 loaded via a namespace (and not attached): [1] tools_2.14.1 </code></pre>

This is not in fact a bug in RJSONIO. It is designed to expect a string that has been read by R and which has the non-ASCII characters already processed. When one passes it a string with \u, that has not been processed but escaped. On my machine with a locale set to en_US.UTF-8, the command <pre class="prettyprint"><code>fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}') </code></pre> produces <pre class="prettyprint"><code>$query $query$categorymembers $query$categorymembers[[1]] $query$categorymembers[[1]]$ns [1] 0 $query$categorymembers[[1]]$title [1] "Banach–Tarski paradox" </code></pre> Note that the character is prefixed by <code>\u</code> not <code>\\u</code>. See how it appears in R when you simply enter that string. So the issue is upstream of fromJSON() as to why the string contains \u. I may add support in RJSONIO for handling such unprocessed strings.

It is a bug in <code>RJSONIO</code> as you can clearly see: <pre class="prettyprint"><code>> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}') x "foo\023bar" </code></pre> It works just fine in <code>rjson</code>: <pre class="prettyprint"><code>> rjson::fromJSON('{"x":"foo\\u2013bar"}') $x [1] "foo–bar" </code></pre> and to prove it is the correct value: <pre class="prettyprint"><code> > Sys.setlocale("LC_ALL", "C") [1] "C/C/C/C/C/en_US.UTF-8" > rjson::fromJSON('{"x":"foo\\u2013bar"}') $x [1] "foo<U+2013>bar" </code></pre> In your analysis you got confused by printed string vs actual strings. <code>print</code> quotes its content for printing - if you want to see the actual string, you can use <code>cat</code> or <code>charToRaw</code>. Also <code>scan</code> doesn't interpret any escapes, so you get what you give it.

How to correctly deal with escaped Unicode Characters in R e.g. the em dash (—)

Tags:

r

unicode

I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like

{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}

Which should be perfectly valid but when read in through fromJSON() I get:

snip...
[1] "Banach\023Tarski paradox"

Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with scan() and readLines(). My guess is that I am missing something very basic.

I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following:

"em\u2013dash" "em–dash" " em \u2013 dash"

Then load up R (for whatever the file path is):

> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash"    "em–dash"          " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""

The added escape character is what causes my problems with fromJSON(). I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks.

Here's the session info:

R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RJSONIO_0.98-0

loaded via a namespace (and not attached):
[1] tools_2.14.1

212

asked Feb 10 '12 06:02

Adam Hyland

2 Answers

This is not in fact a bug in RJSONIO. It is designed to expect a string that has been read by R and which has the non-ASCII characters already processed. When one passes it a string with \u, that has not been processed but escaped. On my machine with a locale set to en_US.UTF-8, the command

fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')

produces

$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0

$query$categorymembers[[1]]$title
[1] "Banach–Tarski paradox"

Note that the character is prefixed by \u not \\u. See how it appears in R when you simply enter that string.

So the issue is upstream of fromJSON() as to why the string contains \u.
I may add support in RJSONIO for handling such unprocessed strings.

104

answered Oct 06 '22 01:10

Duncan Temple Lang

It is a bug in RJSONIO as you can clearly see:

> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}')
           x 
"foo\023bar"

It works just fine in rjson:

> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo–bar"

and to prove it is the correct value:

 > Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/en_US.UTF-8"
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo<U+2013>bar"

In your analysis you got confused by printed string vs actual strings. print quotes its content for printing - if you want to see the actual string, you can use cat or charToRaw. Also scan doesn't interpret any escapes, so you get what you give it.

answered Oct 06 '22 00:10

Simon Urbanek

Related questions
                            
                                Knit error. Object not found
                            
                                Automatic caret parameter tuning fails in glmnet
                            
                                tidyr separate column values into character and numeric using regex
                            
                                Command Lines error in Rstudio console
                            
                                Calling print(ls.str()) in function affect behavior of rep
                            
                                pandoc: Open link in new tab
                            
                                Shiny - how to highlight an object on a leaflet map when selecting a record in a datatable?
                            
                                R How do I merge polygon features in a shapefile with many polygons? (reproducible code example)
                            
                                Difference between double- precision data type and numeric data type
                            
                                Access list element based on attribute value
                            
                                Using r sf::st_write to non-public schema in PostgreSQL
                            
                                Function that will extract hour values from one table and populate "buckets" of one hour increments in another table
                            
                                ggplot2 facet_grid() change background-color
                            
                                GADM-Maps cross-country comparison graphics
                            
                                executing an R script from python
                            
                                How to organize big R functions?
                            
                                Symbolic derivatives and simplification in R
                            
                                Placing the x-axis labels between period ticks
                            
                                mmap and csv files
                            
                                Similar .rdata functionality in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With