Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting encoding of deparsed strings

Tags:

string

text

r

utf-8

I have the following vector:

x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")

I want to convert it a vector with unicode character replaced with the UTF-8, like so:

goal <- list("Chamberlain", "Romañach", "<node>")

The deparsed string is causing problems. If the second string was instead:

wouldbenice <- "Roma\u00F1ach"

Then enc2native(wouldbenice) would do the right thing. (or lapply(x, enc2native) for the whole string.

I can get the second string to display correctly in UTF-8 with:

# displays "Romañach"
eval(parse(text = x[[2]]))

However, this goes poorly (throws parse errors) with x[1] and x[2]. How can I reliably parse the entire list into the appropriate encoding?

like image 590
cboettig Avatar asked Feb 03 '18 21:02

cboettig


2 Answers

Use stringi package.

From stringi use stri_replace_all_regex for replacement and stri_unescape_unicode to unescape Unicode symbols.

library(stringi)

x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")

removed_quotes <- stri_replace_all_regex(x, "\"", "")

unescaped <- stri_unescape_unicode(removed_quotes)

# [1] "Chamberlain" "Romañach"    "<node>" 
like image 156
m0nhawk Avatar answered Sep 21 '22 06:09

m0nhawk


This satisfies the objective in base R, but seems less than ideal in other ways. Putting it here so readers can compare, though I think the stringi-based solution is probably the way to go.

utf8me <- function(x){ 
  i <- grepl('\\u', x) # Not a robust way to detect a unicode char?
  x[i] <- eval(parse(text=x[i])) # 
  x
  }

lapply(x, utf8me)
like image 44
cboettig Avatar answered Sep 23 '22 06:09

cboettig