Short version. I load()
data in a package. Previously, a test in a package passed, now it fails because the output of sort
changed.
Here is a minimal reproducible example - for details see below:
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# OLD 3.5.2 [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# NEW 4.0.0 [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# Update 4.0.2 see comment:
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# From jay.sf's comment
sort.int(y, method="radix")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
sort.int(y, method="shell")
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# From Henrik's comment:
data.table::fsort(y)
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
The only related reported change I found is
CHANGES IN R 4.0.0
NEW FEATURES
...
When loading data sets via read.table(), data() now uses LC_COLLATE=C to ensure locale-independent results for possible string-to-factor conversions.
But I am even not sure, if this could explain what I see.
As I want to minimize the number of imported packages and I would like to understand what's going on, I am not sure how to proceed. Do I miss something?
(A change to a sort.int
with method radix
would do the job, but still: Why did it change? Is that really better?
I just realized, that (thanks to Roland) sort
calls in my case sort.int
:
function (x, decreasing = FALSE, na.last = NA, ...)
{
if (is.object(x))
x[order(x, na.last = na.last, decreasing = decreasing)]
else sort.int(x, na.last = na.last, decreasing = decreasing,
...)
}
From ?sort.int
:
The "auto" method selects "radix" for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, "shell".)
And according to the docs, sort.int
did not change from 4.0.0 to 4.0.2.
From ?data.table::setorder
data.table always reorders in "C-locale". As a consequence, the ordering may be different to that obtained by base::order. In English locales, for example, sorting is case-sensitive in C-locale. Thus, sorting c("c", "a", "B") returns c("B", "a", "c") in data.table but c("a", "B", "c") in base::order. Note this makes no difference in most cases of data; both return identical results on ids where only upper-case or lower-case letters are present ("AB123" < "AC234" is true in both), or on country names and other proper nouns which are consistently capitalized. For example, neither "America" < "Brazil" nor "america" < "brazil" are affected since the first letter is consistently capitalized.
Using C-locale makes the behaviour of sorting in data.table more consistent across sessions and locales. The behaviour of base::order depends on assumptions about the locale of the R session. In English locales, "america" < "BRAZIL" is true by default but false if you either type Sys.setlocale(locale="C") or the R session has been started in a C locale for you – which can happen on servers/services since the locale comes from the environment the R session was started in. By contrast, "america" < "BRAZIL" is always FALSE in data.table regardless of the way your R session was started.
(Related questions Language dependent sorting with R and Best practice: Should I try to change to UTF-8 as locale or is it safe to leave it as is?)
Details
R.version # old _
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.2
year 2018
month 12
day 20
svn rev 75870
language R
version.string R version 3.5.2 (2018-12-20)
nickname Eggshell Igloo
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# =======
R.version # new after upgrade
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.0
year 2020
month 04
day 24
svn rev 78286
language R
version.string R version 4.0.0 (2020-04-24)
nickname Arbor Day
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
#[1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# ==== Test with new 4.0.2
R.version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.2
year 2020
month 06
day 22
svn rev 78730
language R
version.string R version 4.0.2 (2020-06-22)
nickname Taking Off Again
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
However, if it is a really huge dataset, it could take longer to load it later because R first has to extract the file again. So, if you want to save space, then leave it as it is. If you want to save time, add a parameter . Then, the object is available in your workspace with its old name.
Whenever you are not so who will work with the data later on and whether these people are all using R, you might want to export your dataset as a CSV file. Also, it’s human readable.
I understand now why the sort order is different after loading it. It is because the sort order is determined by the engine to optimize memory storage in the model, not to preserve any "original" sorting of the table itself.
I will show you the following ways of saving or exporting your data from R: 1 Saving it as an R object with the functions save() save () and saveRDS() saveRDS () 2 Saving it as a CSV file with write.table() write.table () or fwrite() fwrite () 3 Exporting it to an Excel file with WriteXLS() WriteXLS ()
In summary, it was a bug which has been removed in R version 4.0.1. As @Roland figured out.
From CRAN:
In R 4.0.0,
sort.list(x)
whenis.object(x)
was true, e.g., forx <-I(letters)
, was accidentallyusingmethod = "radix"
. Consequently, e.g.,merge(<data.frame>)
was much slower than previously; reported in PR#17794.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With