Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Language dependent sorting with R

1) How to sort correctly?

The task is to sort abbreviated US states names in accordance with English alphabet. But I noticed, that R sorts lists basing on some kind of operating system language or regional settings. E.g., in my language (Lithuanian) even the order of Latin (non-Lithuanian) letters differs from the order in the English alphabet. Compare order of non-Lithuanian letters only in both alphabets:

"ABCDEFGHI Y JKLMNOPRSTUVZ"

sort(LETTERS)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "Y" "J" "K" "L" "M" "N"
[16] "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Z"

vs.

"ABCDEFGHIJKLMNOPQRSTUVWX Y Z"

LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

So order of sorted abbreviations of the states also differ (notice the last 2, they should be "WV" and then "WY"):

sort(state.abb)
 [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA"
[13] "ID" "IL" "IN" "KY" "KS" "LA" "MA" "MD" "ME" "MI" "MN" "MO"
[25] "MS" "MT" "NC" "ND" "NE" "NH" "NY" "NJ" "NM" "NV" "OH" "OK"
[37] "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA" "VT" "WA" "WI"
[49] "WY" "WV"

I tried Sys.setlocale("LC_TIME","English_United States.1252"). It helped to get English names of weekdays in plots, graphs and figures.

Now I need help to sort correctly in "English" way.

2) What are the other important language-dependent settings in R a beginner R user should pay attention to?

If you have advice, where R behaves language-dependently and how to deal with that, please list it.

like image 421
GegznaV Avatar asked Aug 02 '15 13:08

GegznaV


2 Answers

LC_TIME controls date/time related language collation. For your purposes, LC_ALL should do the trick:

Sys.setenv('LC_ALL', 'English_United States.1252')
sort(letters)

However, beware that these settings are operating system specific. The above would for instance not work on a typical Unix system. Instead, the string 'en_US.UTF-8' is generally a good setting — but under Windows, that itself may pose problems as R’s Unicode support is sketchy on Windows.

like image 176
Konrad Rudolph Avatar answered Oct 14 '22 03:10

Konrad Rudolph


I am not familiar with R but it seems to have the same problem as many other programming languages: the lack of native Unicode support in the standard library. By "Unicode support" I mean chapter 3 from the Unicode standard (http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf), annexes from the Unicode Standard (especially the one that deals with collation http://unicode.org/reports/tr10/) and up-to-date versions of CLDR (http://cldr.unicode.org/). Essentially, there are ambiguous rules for sorting which cannot be standardized without picking some "true" method and neglecting cultural differences. Partially this has been mitigated by allowing multiple collation levels which neglect certain details (like diacritic marks), creating the Case-folding algorithm (in some cases toLower(toUpper(str)) != toLower(str)), defining collation rules through CLDR database but the problem remains intact. There are also issues like context-dependent comparison (http://unicode.org/reports/tr10/#Contextual_Sensitivity) which require you to use a mature solution which conforms to the Unicode Standard if you want to have a 'correct' string comparison.

There is a well-known library called ICU (International Components for Unicode) which implements a great amount of features from the Unicode standard in comparison to other libraries out there. It has implementations in C/C++ and Java (all open-source with BSD-like license but there are bindings to the C version for other languages, including R (https://cran.r-project.org/web/packages/stringi/, http://site.icu-project.org/related). So you could use the 'stringi' project for your text processing using ICU locales and collation facilities.

Update: In order to use ICU collation methods you are going to need to get ICU4C (varies across different OSes) and then install a package for the R language:

install.packages('stringi')

Then you should import it

library(stringi)

after which you can use these types of functions (http://docs.rexamine.com/R-man/stringi/stri_compare.html). You can pass additional parameters to the collator being created at the end of these functions (http://docs.rexamine.com/R-man/stringi/stri_opts_collator.html) which is going to affect how the comparison is going to be performed.

stri_cmp_lt("WV", "WY", locale="lt_LT")
stri_cmp_lt("WV", "WY", locale="en_US")
stri_compare("WV", "WV", locale="en_US", strength='1')

For example, above 'strength' parameter sets the so called 'collation level' (http://unicode.org/reports/tr10/#Notation). The locale is specified by Language and Country Codes as specified here (http://userguide.icu-project.org/locale). You can use these functions to implement a custom sorting function (such as quicksort that uses these functions for comparison) because the built-in functions do not seem to provide any way to change the ordering predicate.

Update2: Or, even better than implementing your own sorting, just use the stri_sort function which allows you to specify a custom ICU collator (http://docs.rexamine.com/R-man/stringi/stri_order.html) as follows:

stri_sort(state.abb, locale="en_US")
stri_sort(state.abb, locale="lt_LT")

[1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WV" "WY"
 [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KY" "KS" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NY" "NJ" "NM" "NV" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WY" "WV"

Notice that WV and WY are in different positions for different locales now.

like image 6
Dmitrii S. Avatar answered Oct 14 '22 04:10

Dmitrii S.