Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How data.table sorts strings when setting key

Tags:

r

data.table

Yesterday I had to spend some time trying to find a bug in my code and I found that data.table package sorts strings in a way a bit different from base. Is this a normal behavior and what is the most efficient way (which has benefits of data.table) to reproduce results obtained with base order function? Here is a toy reproducible example:

library(data.table)
options(stringsAsFactors = FALSE)

d <- data.frame(cn=c("USA","Ubuntu","Uzbekistan"))
d[order(d$cn),,drop=F]

#          cn
#2     Ubuntu
#1        USA
#3 Uzbekistan

dt <- data.table(d)
setkey(dt, cn)
dt

#           cn
#1:        USA
#2:     Ubuntu
#3: Uzbekistan

options(stringsAsFactors = default.stringsAsFactors())

OS Windows 7

like image 289
jem77bfp Avatar asked Aug 20 '13 13:08

jem77bfp


1 Answers

Update March 2014

There's been some debate about this one. As of v1.9.2 we've settled for now on setkey sorting using C locale; e.g., all capital letters come before all lower case letters, regardless of user's locale. This was a change made in v1.8.8 which we had intended to reverse but have stuck with for now.

Consider save()-ing a keyed table in your locale and a colleague load()-ing it in a different locale. When they join to that table it may no longer work correctly if it were locale sort order. We have to think a bit more carefully if setkey is to allow locale ordering again, probably by saving the locale name along with the "sorted" attribute, so data.table can at least compare and detect if the current locale is different to the one that ran setkey.

It's also for speed reasons as sorting according to locale is much slower than C locale. Although, we can do it as efficiently as possible and allowing it optionally would be ideal.

Hence, this is now a feature request and further comments are very welcome.

FR#4842 setkey to sort using session's locale not C locale



Nice catch! The call to setkey in turn calls setkeyv and that calls fastorder to "order" the columns/entries that in turn calls chorder.

chorder in turn calls a C function Ccountingcharacter.c. Now, here I suppose the problem comes due to "locale".

Let's see what "locale" I'm on my mac.

Sys.getLocale()
# [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Now let's see how order sorts it:

x <- c("USA", "Ubuntu", "Uzbekistan")
order(x)
# [1] 2 1 3

Now, let's change the "locale" to "C".

Sys.setlocale("LC_ALL", "C")
# [1] "C/C/C/C/C/en_US.UTF-8"

order(x)
# [1] 1 2 3

From ?order:

The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison.

From ?Comparison:

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z....

So, basically, order as well under "C" locale, gives the same order as data.table's setkey. My guess is that the C-function called by chorder automatically runs on C-locale which will compare ascii values for which "S" comes before "b".

It's probably important to bring this to @MatthewDowle's attention (if he's not already aware of it). So, I'd suggest that you file this as a bug here (just to be sure).

like image 190
Arun Avatar answered Oct 11 '22 19:10

Arun