Unescape unicode in character string

Tags:

There is a long standing bug in RJSONIO for parsing json strings containing unicode escape sequences. It seems like the bug needs to be fixed in libjson which might not happen any time soon, so I am looking in creating a workaround in R which unescapes \uxxxx sequences before feeding them to the json parser.

Some context: json data is always unicode, using utf-8 by default, so there is generally no need for escaping. But for historical reasons, json does support escaped unicode. Hence the json data

{"x" : "Zürich"}

and

{"x" : "Z\u00FCrich"}

are equivalent and should result in exactly the same output when parsed. But for whatever reason, the latter doesn't work in RJSONIO. Additional confusion is caused by the fact that R itself supports escaped unicode as well. So when we type "Z\u00FCrich" in an R console, it is automatically correctly converted to "Zürich". To get the actual json string at hand, we need to escape the backslash itself that is the first character of the unicode escape sequence in json:

test <- '{"x" : "Z\\u00FCrich"}'
cat(test)

So my question is: given a large json string in R, how can I unescape all escaped unicode sequences? I.e. how do I replace all occurrences of \uxxxx by the corresponding unicode character? Again, the \uxxxx here represents an actual string of 6 characters, starting with a backslash. So an unescape function should satisfy:

#Escaped string
escaped <- "Z\\u00FCrich"

#Unescape unicode
unescape(escaped) == "Zürich"

#This is the same thing
unescape(escaped) == "Z\u00FCrich"

One thing that might complicate things is that if the backslash itself is escaped in json with another backslash, it is not part of the unicode escape sequence. E.g. unescape should also satisfy:

#Watch out for escaped backslashes
unescape("Z\\\\u00FCrich") == "Z\\\\u00FCrich"
unescape("Z\\\\\\u00FCrich") == "Z\\\\ürich"

204

asked Jul 25 '14 09:07

Jeroen Ooms

2 Answers

After playing with this some more I think the best I can do is searching for \uxxxx patterns using a regular expression, and then parse those using the R parser:

unescape_unicode <- function(x){
  #single string only
  stopifnot(is.character(x) && length(x) == 1)

  #find matches
  m <- gregexpr("(\\\\)+u[0-9a-z]{4}", x, ignore.case = TRUE)

  if(m[[1]][1] > -1){
    #parse matches
    p <- vapply(regmatches(x, m)[[1]], function(txt){
      gsub("\\", "\\\\", parse(text=paste0('"', txt, '"'))[[1]], fixed = TRUE, useBytes = TRUE)
    }, character(1), USE.NAMES = FALSE)

    #substitute parsed into original
    regmatches(x, m) <- list(p)
  }

  x
}

This seems to work for all cases and I haven't found any odd side effects yet

185

answered Oct 09 '22 06:10

Jeroen Ooms

There is a function for this in stringi package :)

require(stringi)    
escaped <- "Z\\u00FCrich"
escaped
## [1] "Z\\u00FCrich"
stri_unescape_unicode(escaped)
## [1] "Zürich"

answered Oct 09 '22 06:10

bartektartanus

Related questions
                            
                                Why is jQuery's email validation regex so simple?
                            
                                Declaration to make PHP script completely Unicode-friendly
                            
                                How to grep lines that start with double forward slash in Linux command line?
                            
                                PHP: Replace all instances
                            
                                Returning overlapping regular expressions
                            
                                Vim: yank Regex match to +clipboard
                            
                                regular expression - match word only once in line
                            
                                Retaining the pattern characters while splitting via Regex, Ruby
                            
                                Regexp type for closure compiler
                            
                                Counting overlapping matches with Regex in C# [duplicate]
                            
                                Javascript - regular expression to split string on unescaped character, e.g. | but ignore \|
                            
                                Windows CMD's FINDSTR wrong regexp matching
                            
                                MongoDB/PyMongo: how to 'escape' parameters in regex search?
                            
                                Find all 'more or less than' characters which is not tags in xml
                            
                                Split string with "." (dot) while handling abbreviations
                            
                                C# Code to generate strings that match a regex [closed]
                            
                                Regex not allowing certain special characters
                            
                                Regular expression negative lookbehind of non-fixed length
                            
                                How do I use RegEx to pick longest match?
                            
                                What's the best way to regex replace a string in python but keep its case? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unescape unicode in character string

Tags:

json

regex

r

unicode

utf-8

Jeroen Ooms

People also ask

2 Answers

Jeroen Ooms

bartektartanus

Recent Activity

Donate For Us