Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the R sorting rules of character vectors?

Tags:

R sorts character vectors in a sequence which I describe as alphabetic, not ASCII.

For example:

sort(c("dog", "Cat", "Dog", "cat")) [1] "cat" "Cat" "dog" "Dog" 

Three questions:

  1. What is the technically correct terminology to describe this sort order?
  2. I can not find any reference to this in the manuals on CRAN. Where can I find a description of the sorting rules in R?
  3. is this any different from this sort of behaviour in other languages like C, Java, Perl or PHP?
like image 861
Andrie Avatar asked Aug 29 '11 11:08

Andrie


People also ask

How do I sort a character in R?

To sort a vector in R programming, call sort() function and pass the vector as argument to this function. sort() function returns the sorted vector in increasing order. The default sorting order is increasing order. We may sort in decreasing order using rev() function on the output returned by sort().

What is a character vector in R studio?

Character/string – each element in the vector is a string of one or more characters. Built in character vectors are letters and LETTERS which provide the 26 lower (and upper) case letters, respecitively.

What is character sort?

Traditionally, information is displayed in sorted order to enable users to easily find the items they are looking for. However, users of different languages might have very different expectations of what a sorted list should look like.


1 Answers

Details: for sort() states:

 The sort order for character vectors will depend on the collating  sequence of the locale in use: see ‘Comparison’.  The sort order  for factors is the order of their levels (which is particularly  appropriate for ordered factors). 

and help(Comparison) then shows:

 Comparison of strings in character vectors is lexicographicwithin  the strings using the collating sequence of the locale in use:see  ‘locales’.  The collating sequence of locales such as ‘en_US’ is  normally different from ‘C’ (which should use ASCII) and can be  surprising.  Beware of making _any_ assumptions about the   collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,  and collation is not necessarily character-by-character - in  Danish ‘aa’ sorts as a single letter, after ‘z’.  In Welsh ‘ng’  may or may not be a single sorting unit: if it is it follows ‘g’.  Some platforms may not respect the locale and always sort in  numerical order of the bytes in an 8-bit locale, or in Unicode  point order for a UTF-8 locale (and may not sort in the same order  for the same language in different character sets).  Collation of  non-letters (spaces, punctuation signs, hyphens, fractions and so  on) is even more problematic. 

so it depends on your locale setting.

like image 70
Dirk Eddelbuettel Avatar answered Sep 21 '22 15:09

Dirk Eddelbuettel