Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a \u escaped Unicode string to ASCII

After reading all about iconv and Encoding, I am still confused.

I am scraping the source of a web page I have a string that looks like this: 'pretty\u003D\u003Ebig' (displayed in the R console as 'pretty\\\u003D\\\u003Ebig'). I want to convert this to the ASCII string, which should be 'pretty=>big'.

More simply, if I set

x <- 'pretty\\u003D\\u003Ebig'

How do I perform a conversion on x to yield pretty=>big?

Any suggestions?

like image 720
seancarmody Avatar asked Jul 20 '13 11:07

seancarmody


People also ask

How do I decode a string with escaped Unicode?

Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1 / 'latin-1' , so...

How do I convert Unicode to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.

What is escaped Unicode?

A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.

How do I decode Unicode characters?

How to decrypt a text with a Unicode cipher? In order make the translation of a Unicode message, reassociate each identifier code its Unicode character. Example: The message 68,67,934,68,8364 is translated by each number: 68 => D , 67 => C , and so on, in order to obtain DCΦD€ .


2 Answers

Use parse, but don't evaluate the results:

x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1
like image 153
hadley Avatar answered Sep 27 '22 18:09

hadley


With the stringi package:

> x <- 'pretty\\u003D\\u003Ebig'
> stringi::stri_unescape_unicode(x)
[1] "pretty=>big"
like image 20
Stéphane Laurent Avatar answered Sep 27 '22 18:09

Stéphane Laurent