Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - c() unexpectedly converts names of named vectors into UTF-8. Is this a bug?

I've faced a strange behavior of c() with R 3.3.2 on Windows with non-US-English locale. It converts the names of named vectors into UTF-8.

x <- "φ"
names(x) <- "φ"

Encoding(names(x))
#> [1] "unknown"

Encoding(names(c(x)))
#> [1] "UTF-8"

Thought this issue is not problematic for most people, it is critical for those who uses named vectors as lookup tables (example is here: http://adv-r.had.co.nz/Subsetting.html#applications). I am also the one who stuck with the behavior of dplyr's select() function.

I'm not quite sure whether this behavior is a bug or by design. Should I submit a bug report to R core?

Here's info about my R environment:

sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows >= 8 x64 (build 9200)
#> 
#> locale:
#> [1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932    LC_MONETARY=Japanese_Japan.932
#> [4] LC_NUMERIC=C                   LC_TIME=Japanese_Japan.932    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#> [1] tools_3.3.2
like image 960
yutannihilation Avatar asked Dec 03 '16 01:12

yutannihilation


1 Answers

You should still see names(c(x)) == names(x) on your system. The encoding change by c() may be unintentional, but shouldn't affect your code in most scenarios.

On Windows, which doesn't have a UTF-8 locale, your safest bet is to convert all strings to UTF-8 first via enc2utf8(), and then stay in UTF-8. This will also enable safe lookups.

Language symbols (as used in dplyr's group_by()) are an entirely different issue. For some reason they are always interpreted in the native encoding. (Try as.name(names(c(x))).) However, it's still best to have them in UTF-8, and convert to native just before calling as.name(). This is what dplyr should be doing, we're just not quite there yet.

My recommendation is to use ASCII-only characters for column names when using dplyr on Windows. This requires some discipline if you're relying on tidyr::spread() for non-ASCII column contents. You could also consider switching to a system (OS X or Linux) that works with UTF-8 natively.

like image 198
krlmlr Avatar answered Nov 17 '22 00:11

krlmlr