I have the following vector:
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
I want to convert it a vector with unicode character replaced with the UTF-8, like so:
goal <- list("Chamberlain", "Romañach", "<node>")
The deparsed string is causing problems. If the second string was instead:
wouldbenice <- "Roma\u00F1ach"
Then enc2native(wouldbenice)
would do the right thing. (or lapply(x, enc2native)
for the whole string.
I can get the second string to display correctly in UTF-8 with:
# displays "Romañach"
eval(parse(text = x[[2]]))
However, this goes poorly (throws parse errors) with x[1]
and x[2]
. How can I reliably parse the entire list into the appropriate encoding?
Use stringi
package.
From stringi
use stri_replace_all_regex
for replacement and stri_unescape_unicode
to unescape Unicode symbols.
library(stringi)
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
removed_quotes <- stri_replace_all_regex(x, "\"", "")
unescaped <- stri_unescape_unicode(removed_quotes)
# [1] "Chamberlain" "Romañach" "<node>"
This satisfies the objective in base R, but seems less than ideal in other ways. Putting it here so readers can compare, though I think the stringi
-based solution is probably the way to go.
utf8me <- function(x){
i <- grepl('\\u', x) # Not a robust way to detect a unicode char?
x[i] <- eval(parse(text=x[i])) #
x
}
lapply(x, utf8me)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With