Encoding and raw in R

Tags:

I not sure if this is a bug or not. If I encode one of the characters to UTF-8 before converting to raw and back again, then the characters are not the same. I have set default encoding to "UTF-8" in RStudio.

Click to copy

rawToChar(charToRaw(enc2utf8("vægt")))
[1] "vÃ¦gt"

rawToChar(charToRaw("vægt"))
[1] "vægt"

Here is my sessionInfo()

Click to copy

R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggthemes_2.2.1  TTR_0.23-0      lubridate_1.3.3 tidyr_0.2.0     skm_1.0.2       ggplot2_1.0.1   dplyr_0.4.3    
[8] stringr_1.0.0   dkstat_0.08    

loaded via a namespace (and not attached):
[1] Rcpp_0.12.1      rstudioapi_0.3.1 magrittr_1.5     MASS_7.3-43      munsell_0.4.2    lattice_0.20-33 
[7] colorspace_1.2-6 R6_2.1.1         httr_1.0.0       plyr_1.8.3       xts_0.9-7        tools_3.2.2     
[13] parallel_3.2.2   grid_3.2.2       gtable_0.1.2     DBI_0.3.1        lazyeval_0.1.10  assertthat_0.1  
[19] digest_0.6.8     reshape2_1.4.1   curl_0.9.3       memoise_0.2.1    labeling_0.3     stringi_0.5-5   
[25] scales_0.3.0     jsonlite_0.9.17  zoo_1.7-12       proto_0.3-10

202

asked Oct 11 '15 17:10

KERO

1 Answers

Here's my basic understanding of what's going on.

First some encoding facts:

Click to copy

                  Encoding
character    UTF-8        CP1252
   v         76             76
   æ         c3 a6          e6
   g         67             67
   t         74             74
   Ã         c3 83          c3
   ¦         c2 a6          a6

Now the mechanics:

The Windows machine uses the CP1252 encoding as can be seen from the sessionInfo output. So the vægt string in the R script is represented as the bytes 76 e6 67 74. This is confirmed by charToRaw("vægt"). If we then convert it to UTF-8, we get 76 c3 a6 67 74. The fact that these bytes represent UTF-8 is lost. Later rawToChar() converts these bytes back to a string, again assuming CP1252. Since c3 is Ã and a6 is ¦ in CP1252, we get vÃ¦gt.

On Mac and Linux, on the other hand, the default encoding is UTF-8 throughout and the encoding mismatches do not occur. I suspect, however, that the same phenomenon as on Windows could be triggered by explicitly changing/setting the encoding used by R.

I don't think this is a bug.

answered Oct 06 '22 02:10

WhiteViking

Related questions
                            
                                Specify CSL styles on RMarkdown
                            
                                Label size in sankey plots (riverplot package)
                            
                                Running a Shiny App from GitHub
                            
                                set x/y limits in facet_wrap with scales = 'free'
                            
                                How to export a table from R to latex and include dimension names?
                            
                                R: Replace values in nested list
                            
                                R merge based on condition other than equality
                            
                                How to extract only person A's statements in a conversation between two persons A and B
                            
                                Replicating Stata Probit with robust errors in R
                            
                                Write and read 3D arrays in R
                            
                                Sending Email Attachement Through Outlook in R with RDCOMClient
                            
                                Using geo-coordinates as vertex coordinates in the igraph r-package
                            
                                Convert a data frame to a treeNetwork compatible list
                            
                                exists and sapply: why are these functions different?
                            
                                Error R Studio Knit HTML with install.packages line
                            
                                How can I get a button in Shiny to call both JavaScript and R code in parallel?
                            
                                Include tests in binary R package
                            
                                What is the R equivalent of pandas .resample() method?
                            
                                adding custom function to summarise in dplyr
                            
                                Invoke interrupt from R code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Encoding and raw in R

Tags:

r

character-encoding

KERO

People also ask

1 Answers

WhiteViking

Recent Activity

Donate For Us