After reading all about iconv
and Encoding
, I am still confused.
I am scraping the source of a web page I have a string that looks like this: 'pretty\u003D\u003Ebig'
(displayed in the R console as 'pretty\\\u003D\\\u003Ebig'
). I want to convert this to the ASCII string, which should be 'pretty=>big'
.
More simply, if I set
x <- 'pretty\\u003D\\u003Ebig'
How do I perform a conversion on x
to yield pretty=>big
?
Any suggestions?
Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1 / 'latin-1' , so...
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.
A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.
How to decrypt a text with a Unicode cipher? In order make the translation of a Unicode message, reassociate each identifier code its Unicode character. Example: The message 68,67,934,68,8364 is translated by each number: 68 => D , 67 => C , and so on, in order to obtain DCΦD€ .
Use parse, but don't evaluate the results:
x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1
With the stringi
package:
> x <- 'pretty\\u003D\\u003Ebig'
> stringi::stri_unescape_unicode(x)
[1] "pretty=>big"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With