Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does as.factor() on unicode strings return different results for every operating system?

Tags:

r

unicode

Why does this code: as.factor(c("\U201C", '"3', "1", "2", "\U00B5")), return different orderings of factor levels on every operating system?

On Linux:

> as.factor(c("\U201C",'"3', "1", "2","\U00B5")) [1] " "3 1 2 µ Levels: µ " 1 2 "3

On Windows:

> as.factor(c("\U201C",'"3', "1", "2","\U00B5")) [1] " "3 1 2 µ Levels: "3 " µ 1 2

On Mac OS:

>as.factor(c("\U201C",'"3', "1", "2","\U00B5")) [1] " "3 1 2 µ Levels: "3 " 1 2 µ

I had some students submit an RMardkown assignment that contained as.numeric(as.factor(dat$var)). Now granted this is not a good way to code, but the inconsistency in output lead to much confusion and wasted time.

like image 743
MilesMcBain Avatar asked Sep 06 '16 01:09

MilesMcBain


1 Answers

It's not just Unicode and not just R; sort in general (as in even the *nix command sort) can be locale specific. Setting LC_COLLATE (presumably to "C") via Sys.setlocale (as per @alistaire's comment) on all machines is required to remove the differences.

For me, on Windows (7):

sort(c("Abc", "abc", "_abc", "ABC"))
[1] "_abc" "abc"  "Abc"  "ABC" 

whereas on Linux (Ubuntu 12.04 ... wow, I need to upgrade that machine) I get

sort(c("Abc", "abc", "_abc", "ABC"))
[1] "abc"  "_abc" "Abc"  "ABC" 

Setting the locale as per above via

Sys.setlocale("LC_COLLATE", "C")

gives

sort(c("Abc", "abc", "_abc", "ABC"))
[1] "ABC"  "Abc"  "_abc" "abc" 

on both machines, identically.

The *nix man page for sort gives the bold warning

   *** WARNING *** The locale specified by the  environment  affects  sort
   order.  Set LC_ALL=C to get the traditional sort order that uses native
   byte values.

Update: Looks like I reproduce the issue when including Unicode characters. The issue traces back to sort - try sorting the vector in your example. I can't seem to change the locale (LC_COLLATE or LC_CTYPE) to "en_AU.UTF-8" either, which would be a potential solution.

like image 162
Jonathan Carroll Avatar answered Oct 02 '22 13:10

Jonathan Carroll