I'm investigating how the character encoding affects sorting. My question here is:
How I can change a single column of a data frame to a different character encoding?
For context, I will include several extra steps at the bottom.
1) Create the data frame:
d.enc <- data.frame( utf8 = c(" ", "_ ", " _"),
mac = c(" ", "_ ", " _"),
label = c("space", "underscore space", "space underscore") )
2) Convert to character vectors and attempt to set encoding:
d.enc2$utf8 <- as.character(d.enc$utf8)
d.enc2$mac <- as.character(d.enc$mac)
d.enc2$label <- as.character(d.enc$label)
Encoding(d.enc2$utf8) <- "UTF-8"
Encoding(d.enc2$mac) <- "MACINTOSH"
Encoding(d.enc2$utf8)
# [1] "unknown" "unknown" "unknown"
Encoding(d.enc2$mac)
# [1] "unknown" "unknown" "unknown"
3) That's not what I was hoping for. I would have expected:
# [1] "UTF-8" "UTF-8" "UTF-8" and
# [1] "MACINTOSH" "MACINTOSH" "MACINTOSH"
4) Are my desired encodings supported? (Running on a mac)
temp <- iconvlist()
temp[399]
# [1] "UTF-8"
temp[338]
# [1] "MACINTOSH"
Seems that they are supported.
5) Once I can change the encodings, I would like to do the following to see how the sorting order changes:
library(dplyr)
arrange(d.enc2, desc(utf8))
arrange(d.enc2, desc(mac))
6) I expect the output will look something like this but in a different order depending on which column is used for the sorting:
utf8 mac label
1 _ _ underscore space
2 _ _ space underscore
3 space
Thanks for any tips!
Maybe late, but I saw this at: R- Changing encoding of column in dataframe?
for (col in colnames(mydataframe)){
Encoding(mydataframe[[col]]) <- "UTF-8"}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With