Why does this code: as.factor(c("\U201C", '"3', "1", "2", "\U00B5")), return different orderings of factor levels on every operating system?
On Linux:
> as.factor(c("\U201C",'"3', "1", "2","\U00B5"))
[1] "  "3 1  2  µ 
Levels: µ " 1 2 "3
On Windows:
> as.factor(c("\U201C",'"3', "1", "2","\U00B5"))
[1] "  "3 1  2  µ 
Levels: "3 " µ 1 2
On Mac OS:
>as.factor(c("\U201C",'"3', "1", "2","\U00B5"))
[1] "  "3 1  2  µ 
Levels: "3 " 1 2 µ
I had some students submit an RMardkown assignment that contained as.numeric(as.factor(dat$var)). Now granted this is not a good way to code, but the inconsistency in output lead to much confusion and wasted time.
It's not just Unicode and not just R; sort in general (as in even the *nix command sort) can be locale specific. Setting LC_COLLATE (presumably to "C") via Sys.setlocale (as per @alistaire's comment) on all machines is required to remove the differences.
For me, on Windows (7):
sort(c("Abc", "abc", "_abc", "ABC"))
[1] "_abc" "abc"  "Abc"  "ABC" 
whereas on Linux (Ubuntu 12.04 ... wow, I need to upgrade that machine) I get
sort(c("Abc", "abc", "_abc", "ABC"))
[1] "abc"  "_abc" "Abc"  "ABC" 
Setting the locale as per above via
Sys.setlocale("LC_COLLATE", "C")
gives
sort(c("Abc", "abc", "_abc", "ABC"))
[1] "ABC"  "Abc"  "_abc" "abc" 
on both machines, identically.
The *nix man page for sort gives the bold warning
*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
Update: Looks like I reproduce the issue when including Unicode characters. The issue traces back to sort - try sorting the vector in your example. I can't seem to change the locale (LC_COLLATE or LC_CTYPE) to "en_AU.UTF-8" either, which would be a potential solution. 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With