Yesterday I had to spend some time trying to find a bug in my code and I found that data.table
package sorts strings in a way a bit different from base. Is this a normal behavior and what is the most efficient way (which has benefits of data.table
) to reproduce results obtained with base order
function? Here is a toy reproducible example:
library(data.table)
options(stringsAsFactors = FALSE)
d <- data.frame(cn=c("USA","Ubuntu","Uzbekistan"))
d[order(d$cn),,drop=F]
# cn
#2 Ubuntu
#1 USA
#3 Uzbekistan
dt <- data.table(d)
setkey(dt, cn)
dt
# cn
#1: USA
#2: Ubuntu
#3: Uzbekistan
options(stringsAsFactors = default.stringsAsFactors())
OS Windows 7
Update March 2014
There's been some debate about this one. As of v1.9.2 we've settled for now on setkey
sorting using C locale; e.g., all capital letters come before all lower case letters, regardless of user's locale. This was a change made in v1.8.8 which we had intended to reverse but have stuck with for now.
Consider save()
-ing a keyed table in your locale and a colleague load()
-ing it in a different locale. When they join to that table it may no longer work correctly if it were locale sort order. We have to think a bit more carefully if setkey
is to allow locale ordering again, probably by saving the locale name along with the "sorted" attribute, so data.table
can at least compare and detect if the current locale is different to the one that ran setkey
.
It's also for speed reasons as sorting according to locale is much slower than C locale. Although, we can do it as efficiently as possible and allowing it optionally would be ideal.
Hence, this is now a feature request and further comments are very welcome.
FR#4842 setkey to sort using session's locale not C locale
Nice catch! The call to setkey
in turn calls setkeyv
and that calls fastorder
to "order" the columns/entries that in turn calls chorder
.
chorder
in turn calls a C function Ccountingcharacter.c
. Now, here I suppose the problem comes due to "locale".
Let's see what "locale" I'm on my mac.
Sys.getLocale()
# [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
Now let's see how order
sorts it:
x <- c("USA", "Ubuntu", "Uzbekistan")
order(x)
# [1] 2 1 3
Now, let's change the "locale" to "C".
Sys.setlocale("LC_ALL", "C")
# [1] "C/C/C/C/C/en_US.UTF-8"
order(x)
# [1] 1 2 3
From ?order
:
The sort order for character vectors will depend on the collating sequence of the locale in use: see
Comparison
.
From ?Comparison
:
Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z....
So, basically, order
as well under "C" locale, gives the same order as data.table
's setkey
. My guess is that the C-function called by chorder
automatically runs on C-locale which will compare ascii values for which "S" comes before "b".
It's probably important to bring this to @MatthewDowle's attention (if he's not already aware of it). So, I'd suggest that you file this as a bug here (just to be sure).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With