Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding and raw in R

I not sure if this is a bug or not. If I encode one of the characters to UTF-8 before converting to raw and back again, then the characters are not the same. I have set default encoding to "UTF-8" in RStudio.

rawToChar(charToRaw(enc2utf8("vægt")))
[1] "vægt"

rawToChar(charToRaw("vægt"))
[1] "vægt"

Here is my sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggthemes_2.2.1  TTR_0.23-0      lubridate_1.3.3 tidyr_0.2.0     skm_1.0.2       ggplot2_1.0.1   dplyr_0.4.3    
[8] stringr_1.0.0   dkstat_0.08    

loaded via a namespace (and not attached):
[1] Rcpp_0.12.1      rstudioapi_0.3.1 magrittr_1.5     MASS_7.3-43      munsell_0.4.2    lattice_0.20-33 
[7] colorspace_1.2-6 R6_2.1.1         httr_1.0.0       plyr_1.8.3       xts_0.9-7        tools_3.2.2     
[13] parallel_3.2.2   grid_3.2.2       gtable_0.1.2     DBI_0.3.1        lazyeval_0.1.10  assertthat_0.1  
[19] digest_0.6.8     reshape2_1.4.1   curl_0.9.3       memoise_0.2.1    labeling_0.3     stringi_0.5-5   
[25] scales_0.3.0     jsonlite_0.9.17  zoo_1.7-12       proto_0.3-10    
like image 202
KERO Avatar asked Oct 11 '15 17:10

KERO


People also ask

What encoding should I use for R?

If in doubt about which encoding to use, use UTF-8, as it can encode any Unicode character.

How do I check the encoding of a CSV file in R?

To detect encoding of the strings you should use detect_str_enc() function. It is vectorized and accepts the character vector. Missing values will be skipped. All strings in R could be only in three encodings - UTF-8 , Latin1 and native .

What is native encoding?

"Natively encoded" strings are strings written in whatever code page the user is using. That is, they are numbers that are translated to the appropriate glyphs based on the correct code page. Assuming the file was saved that way and not as a UTF-8 file.


1 Answers

Here's my basic understanding of what's going on.

First some encoding facts:

                  Encoding
character    UTF-8        CP1252
   v         76             76
   æ         c3 a6          e6
   g         67             67
   t         74             74
   Ã         c3 83          c3
   ¦         c2 a6          a6

Now the mechanics:

The Windows machine uses the CP1252 encoding as can be seen from the sessionInfo output. So the vægt string in the R script is represented as the bytes 76 e6 67 74. This is confirmed by charToRaw("vægt"). If we then convert it to UTF-8, we get 76 c3 a6 67 74. The fact that these bytes represent UTF-8 is lost. Later rawToChar() converts these bytes back to a string, again assuming CP1252. Since c3 is à and a6 is ¦ in CP1252, we get vægt.

On Mac and Linux, on the other hand, the default encoding is UTF-8 throughout and the encoding mismatches do not occur. I suspect, however, that the same phenomenon as on Windows could be triggered by explicitly changing/setting the encoding used by R.

I don't think this is a bug.

like image 71
WhiteViking Avatar answered Oct 06 '22 02:10

WhiteViking