I not sure if this is a bug or not. If I encode one of the characters to UTF-8 before converting to raw and back again, then the characters are not the same. I have set default encoding to "UTF-8" in RStudio.
rawToChar(charToRaw(enc2utf8("vægt")))
[1] "vægt"
rawToChar(charToRaw("vægt"))
[1] "vægt"
Here is my sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggthemes_2.2.1 TTR_0.23-0 lubridate_1.3.3 tidyr_0.2.0 skm_1.0.2 ggplot2_1.0.1 dplyr_0.4.3
[8] stringr_1.0.0 dkstat_0.08
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 rstudioapi_0.3.1 magrittr_1.5 MASS_7.3-43 munsell_0.4.2 lattice_0.20-33
[7] colorspace_1.2-6 R6_2.1.1 httr_1.0.0 plyr_1.8.3 xts_0.9-7 tools_3.2.2
[13] parallel_3.2.2 grid_3.2.2 gtable_0.1.2 DBI_0.3.1 lazyeval_0.1.10 assertthat_0.1
[19] digest_0.6.8 reshape2_1.4.1 curl_0.9.3 memoise_0.2.1 labeling_0.3 stringi_0.5-5
[25] scales_0.3.0 jsonlite_0.9.17 zoo_1.7-12 proto_0.3-10
If in doubt about which encoding to use, use UTF-8, as it can encode any Unicode character.
To detect encoding of the strings you should use detect_str_enc() function. It is vectorized and accepts the character vector. Missing values will be skipped. All strings in R could be only in three encodings - UTF-8 , Latin1 and native .
"Natively encoded" strings are strings written in whatever code page the user is using. That is, they are numbers that are translated to the appropriate glyphs based on the correct code page. Assuming the file was saved that way and not as a UTF-8 file.
Here's my basic understanding of what's going on.
First some encoding facts:
Encoding
character UTF-8 CP1252
v 76 76
æ c3 a6 e6
g 67 67
t 74 74
à c3 83 c3
¦ c2 a6 a6
Now the mechanics:
The Windows machine uses the CP1252 encoding as can be seen from the sessionInfo
output. So the vægt
string in the R script is represented as the bytes 76 e6 67 74
. This is confirmed by charToRaw("vægt")
. If we then convert it to UTF-8, we get 76 c3 a6 67 74
. The fact that these bytes represent UTF-8 is lost. Later rawToChar()
converts these bytes back to a string, again assuming CP1252. Since c3
is Ã
and a6
is ¦
in CP1252, we get vægt
.
On Mac and Linux, on the other hand, the default encoding is UTF-8 throughout and the encoding mismatches do not occur. I suspect, however, that the same phenomenon as on Windows could be triggered by explicitly changing/setting the encoding used by R.
I don't think this is a bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With