Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unescape unicode in character string

There is a long standing bug in RJSONIO for parsing json strings containing unicode escape sequences. It seems like the bug needs to be fixed in libjson which might not happen any time soon, so I am looking in creating a workaround in R which unescapes \uxxxx sequences before feeding them to the json parser.

Some context: json data is always unicode, using utf-8 by default, so there is generally no need for escaping. But for historical reasons, json does support escaped unicode. Hence the json data

{"x" : "Zürich"}

and

{"x" : "Z\u00FCrich"}

are equivalent and should result in exactly the same output when parsed. But for whatever reason, the latter doesn't work in RJSONIO. Additional confusion is caused by the fact that R itself supports escaped unicode as well. So when we type "Z\u00FCrich" in an R console, it is automatically correctly converted to "Zürich". To get the actual json string at hand, we need to escape the backslash itself that is the first character of the unicode escape sequence in json:

test <- '{"x" : "Z\\u00FCrich"}'
cat(test)

So my question is: given a large json string in R, how can I unescape all escaped unicode sequences? I.e. how do I replace all occurrences of \uxxxx by the corresponding unicode character? Again, the \uxxxx here represents an actual string of 6 characters, starting with a backslash. So an unescape function should satisfy:

#Escaped string
escaped <- "Z\\u00FCrich"

#Unescape unicode
unescape(escaped) == "Zürich"

#This is the same thing
unescape(escaped) == "Z\u00FCrich"

One thing that might complicate things is that if the backslash itself is escaped in json with another backslash, it is not part of the unicode escape sequence. E.g. unescape should also satisfy:

#Watch out for escaped backslashes
unescape("Z\\\\u00FCrich") == "Z\\\\u00FCrich"
unescape("Z\\\\\\u00FCrich") == "Z\\\\ürich"
like image 204
Jeroen Ooms Avatar asked Jul 25 '14 09:07

Jeroen Ooms


People also ask

How do I decode a string with escaped Unicode?

Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã').

How do I decode Unicode characters?

How to decrypt a text with a Unicode cipher? In order make the translation of a Unicode message, reassociate each identifier code its Unicode character. Example: The message 68,67,934,68,8364 is translated by each number: 68 => D , 67 => C , and so on, in order to obtain DCΦD€ .

What is Unicode string example?

Encodings. To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

Can we convert Unicode to text?

World's simplest unicode tool. This browser-based utility converts fancy Unicode text back to regular text. All Unicode glyphs that you paste or enter in the text area as the input automatically get converted to simple ASCII characters in the output.


2 Answers

After playing with this some more I think the best I can do is searching for \uxxxx patterns using a regular expression, and then parse those using the R parser:

unescape_unicode <- function(x){
  #single string only
  stopifnot(is.character(x) && length(x) == 1)

  #find matches
  m <- gregexpr("(\\\\)+u[0-9a-z]{4}", x, ignore.case = TRUE)

  if(m[[1]][1] > -1){
    #parse matches
    p <- vapply(regmatches(x, m)[[1]], function(txt){
      gsub("\\", "\\\\", parse(text=paste0('"', txt, '"'))[[1]], fixed = TRUE, useBytes = TRUE)
    }, character(1), USE.NAMES = FALSE)

    #substitute parsed into original
    regmatches(x, m) <- list(p)
  }

  x
}

This seems to work for all cases and I haven't found any odd side effects yet

like image 185
Jeroen Ooms Avatar answered Oct 09 '22 06:10

Jeroen Ooms


There is a function for this in stringi package :)

require(stringi)    
escaped <- "Z\\u00FCrich"
escaped
## [1] "Z\\u00FCrich"
stri_unescape_unicode(escaped)
## [1] "Zürich"
like image 35
bartektartanus Avatar answered Oct 09 '22 06:10

bartektartanus