Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why did R's sorting change data imported with load() after an upgrade from 3.5.2 to 4.0.0?

Tags:

sorting

r

Short version. I load() data in a package. Previously, a test in a package passed, now it fails because the output of sort changed. Here is a minimal reproducible example - for details see below:

y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# OLD 3.5.2 [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"        
# NEW 4.0.0 [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital" 
# Update 4.0.2 see comment:
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"     

# From jay.sf's comment
sort.int(y, method="radix")
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"  
sort.int(y, method="shell")
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"  

# From Henrik's comment:
data.table::fsort(y)
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"  

The only related reported change I found is

CHANGES IN R 4.0.0
NEW FEATURES
...
When loading data sets via read.table(), data() now uses LC_COLLATE=C to ensure locale-independent results for possible string-to-factor conversions.

But I am even not sure, if this could explain what I see. As I want to minimize the number of imported packages and I would like to understand what's going on, I am not sure how to proceed. Do I miss something? (A change to a sort.int with method radix would do the job, but still: Why did it change? Is that really better?

I just realized, that (thanks to Roland) sort calls in my case sort.int:

function (x, decreasing = FALSE, na.last = NA, ...) 
{
  if (is.object(x)) 
    x[order(x, na.last = na.last, decreasing = decreasing)]
  else sort.int(x, na.last = na.last, decreasing = decreasing, 
    ...)
}

From ?sort.int:

The "auto" method selects "radix" for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, "shell".)

And according to the docs, sort.int did not change from 4.0.0 to 4.0.2.

From ?data.table::setorder

data.table always reorders in "C-locale". As a consequence, the ordering may be different to that obtained by base::order. In English locales, for example, sorting is case-sensitive in C-locale. Thus, sorting c("c", "a", "B") returns c("B", "a", "c") in data.table but c("a", "B", "c") in base::order. Note this makes no difference in most cases of data; both return identical results on ids where only upper-case or lower-case letters are present ("AB123" < "AC234" is true in both), or on country names and other proper nouns which are consistently capitalized. For example, neither "America" < "Brazil" nor "america" < "brazil" are affected since the first letter is consistently capitalized.

Using C-locale makes the behaviour of sorting in data.table more consistent across sessions and locales. The behaviour of base::order depends on assumptions about the locale of the R session. In English locales, "america" < "BRAZIL" is true by default but false if you either type Sys.setlocale(locale="C") or the R session has been started in a C locale for you – which can happen on servers/services since the locale comes from the environment the R session was started in. By contrast, "america" < "BRAZIL" is always FALSE in data.table regardless of the way your R session was started.

(Related questions Language dependent sorting with R and Best practice: Should I try to change to UTF-8 as locale or is it safe to leave it as is?)


Details

R.version # old              _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          5.2                         
year           2018                        
month          12                          
day            20                          
svn rev        75870                       
language       R                           
version.string R version 3.5.2 (2018-12-20)
nickname       Eggshell Igloo 

y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y, locale = "C")
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"   

# =======
R.version # new after upgrade
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          4                           
minor          0.0                         
year           2020                        
month          04                          
day            24                          
svn rev        78286                       
language       R                           
version.string R version 4.0.0 (2020-04-24)
nickname       Arbor Day

y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"   

stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y, locale = "C")
#[1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital"  

# ==== Test with new 4.0.2
R.version
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          4                           
minor          0.2                         
year           2020                        
month          06                          
day            22                          
svn rev        78730                       
language       R                           
version.string R version 4.0.2 (2020-06-22)
nickname       Taking Off Again 

y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz"       "Seespital"    "SRZ"         

stringr::str_sort(y, locale = "C")
# [1] "SRZ"          "Schaffhausen" "Schwyz"       "Seespital" 
like image 910
Christoph Avatar asked Aug 06 '20 07:08

Christoph


People also ask

Why does it take so long for R to load data?

However, if it is a really huge dataset, it could take longer to load it later because R first has to extract the file again. So, if you want to save space, then leave it as it is. If you want to save time, add a parameter . Then, the object is available in your workspace with its old name.

Should I export my dataset as a CSV or R file?

Whenever you are not so who will work with the data later on and whether these people are all using R, you might want to export your dataset as a CSV file. Also, it’s human readable.

Why is the sort order different After loading a table?

I understand now why the sort order is different after loading it. It is because the sort order is determined by the engine to optimize memory storage in the model, not to preserve any "original" sorting of the table itself.

How do I save or export my data from R?

I will show you the following ways of saving or exporting your data from R: 1 Saving it as an R object with the functions save() save () and saveRDS() saveRDS () 2 Saving it as a CSV file with write.table() write.table () or fwrite() fwrite () 3 Exporting it to an Excel file with WriteXLS() WriteXLS ()


1 Answers

In summary, it was a bug which has been removed in R version 4.0.1. As @Roland figured out.
From CRAN:

In R 4.0.0, sort.list(x) when is.object(x) was true, e.g., for x <-I(letters), was accidentally usingmethod = "radix". Consequently, e.g., merge(<data.frame>) was much slower than previously; reported in PR#17794.

like image 98
Christoph Avatar answered Oct 19 '22 23:10

Christoph