Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does R 3.6.0 return FALSE when evaluating the expression ("Dogs" < "cats")?

Tags:

r

case

ascii

I have some complicated code, but instead of showing you that, I am going to extract the essence of the problem.

Evaluate: "dogs" < "cats" … This should evaluate to FALSE and it does in R 3.6.

Evaluate: "Dogs" < "cats" … This should evaluate to TRUE because the ASCII code for "D" is 68 and the ASCII code for "c" is 99. Since 68 < 99, "Dogs" < "cats" should evaluate to TRUE, but it does not in R 3.6.0. However, when I tried using the Console window on the https://datacamp.com website, the expression "Dogs" < "cats" returned TRUE and the expression "dogs" < "Cats" returned FALSE - as expected.

Hence, my question is, why does R 3.6.0 return FALSE for ("Dogs" < "cats") ?

like image 781
Dr. Donald Tynes II PE PhD Avatar asked Jun 06 '19 22:06

Dr. Donald Tynes II PE PhD


1 Answers

The interpreter at DataCamp shows:

> Sys.getlocale()
[1] "C"

whereas mine and maybe yours is:

> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

With the "C" locale, characters are compared by their ascii values, whereas for en_US.UTF-8, they go aAbBcC and so on.

As mentioned in the comments, this is explained further in the documentation for relational operators:

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode code-point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.

like image 150
C. Braun Avatar answered Nov 13 '22 16:11

C. Braun