Converting encoding of deparsed strings

Question

I have the following vector:

x <- list("Chamberlain", "\"Roma\u00F1ach\"", "<node>")

I want to convert it a vector with unicode character replaced with the UTF-8, like so:

goal <- list("Chamberlain", "Romañach", "<node>")

The deparsed string is causing problems. If the second string was instead:

wouldbenice <- "Roma\u00F1ach"

Then enc2native(wouldbenice) would do the right thing. (or lapply(x, enc2native) for the whole string.

I can get the second string to display correctly in UTF-8 with:

# displays "Romañach"
eval(parse(text = x[[2]]))

However, this goes poorly (throws parse errors) with x[1] and x[2]. How can I reliably parse the entire list into the appropriate encoding?

m0nhawk · Accepted Answer

Use stringi package.

From stringi use stri_replace_all_regex for replacement and stri_unescape_unicode to unescape Unicode symbols.

library(stringi)

x <- list("Chamberlain", "\"Roma\u00F1ach\"", "<node>")

removed_quotes <- stri_replace_all_regex(x, "\"", "")

unescaped <- stri_unescape_unicode(removed_quotes)

# [1] "Chamberlain" "Romañach"    "<node>"

cboettig · Answer

This satisfies the objective in base R, but seems less than ideal in other ways. Putting it here so readers can compare, though I think the stringi-based solution is probably the way to go.

utf8me <- function(x){ 
  i <- grepl('\u', x) # Not a robust way to detect a unicode char?
  x[i] <- eval(parse(text=x[i])) # 
  x
  }

lapply(x, utf8me)

Converting encoding of deparsed strings

Tags:

string

text

r

utf-8

cboettig

2 Answers

m0nhawk

cboettig

Recent Activity

Donate For Us

Converting encoding of deparsed strings

Tags:

string

text

r

utf-8

cboettig

2 Answers

m0nhawk

cboettig

Related questions

Recent Activity

Donate For Us