Why does this code: as.factor(c("\U201C", '"3', "1", "2", "\U00B5"))
, return different orderings of factor levels on every operating system?
On Linux:
> as.factor(c("\U201C",'"3', "1", "2","\U00B5"))
[1] " "3 1 2 µ
Levels: µ " 1 2 "3
On Windows:
> as.factor(c("\U201C",'"3', "1", "2","\U00B5"))
[1] " "3 1 2 µ
Levels: "3 " µ 1 2
On Mac OS:
>as.factor(c("\U201C",'"3', "1", "2","\U00B5"))
[1] " "3 1 2 µ
Levels: "3 " 1 2 µ
I had some students submit an RMardkown assignment that contained as.numeric(as.factor(dat$var))
. Now granted this is not a good way to code, but the inconsistency in output lead to much confusion and wasted time.
It's not just Unicode and not just R; sort
in general (as in even the *nix command sort
) can be locale specific. Setting LC_COLLATE
(presumably to "C"
) via Sys.setlocale
(as per @alistaire's comment) on all machines is required to remove the differences.
For me, on Windows (7):
sort(c("Abc", "abc", "_abc", "ABC"))
[1] "_abc" "abc" "Abc" "ABC"
whereas on Linux (Ubuntu 12.04 ... wow, I need to upgrade that machine) I get
sort(c("Abc", "abc", "_abc", "ABC"))
[1] "abc" "_abc" "Abc" "ABC"
Setting the locale as per above via
Sys.setlocale("LC_COLLATE", "C")
gives
sort(c("Abc", "abc", "_abc", "ABC"))
[1] "ABC" "Abc" "_abc" "abc"
on both machines, identically.
The *nix man
page for sort
gives the bold warning
*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
Update: Looks like I reproduce the issue when including Unicode characters. The issue traces back to sort
- try sorting the vector in your example. I can't seem to change the locale (LC_COLLATE
or LC_CTYPE
) to "en_AU.UTF-8"
either, which would be a potential solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With