When rbind
ing two data.table
with ordered factors, the ordering seems to be lost:
dtb1 = data.table(id = factor(c("a", "b"), levels = c("a", "c", "b"), ordered=T), key="id")
dtb2 = data.table(id = factor(c("c"), levels = c("a", "c", "b"), ordered=T), key="id")
test = rbind(dtb1, dtb2)
is.ordered(test$id)
#[1] FALSE
Any thoughts or ideas?
data.table
does some fancy footwork that means that data.table:::.rbind.data.table
is called when rbind
is called on objects including data.tables
. .rbind.data.table
utilizes the speedups associated with rbindlist
, with a bit of extra checking to match by name etc.
.rbind.data.table
deals with factor columns by using c
to combine them (hence retaining the levels attribute)
# the relevant code is
l = lapply(seq_along(allargs[[1L]]), function(i) do.call("c",
lapply(allargs, "[[", i)))
In base
R
using c
in this manner does not retain the "ordered" attribute, it doesn't even return a factor!
For example (in base
R
)
f <- factor(1:2, levels = 2:1, ordered=TRUE)
g <- factor(1:2, levels = 2:1, ordered=TRUE)
# it isn't ordered!
is.ordered(c(f,g))
# [1] FALSE
# no suprise as it isn't even a factor!
is.factor(c(f,g))
# [1] FALSE
However data.table
has an S3 method c.factor
, which is used to ensure that a factor is returned and the levels are retained. Unfortunately this method does not retain the ordered attribute.
getAnywhere('c.factor')
# A single object matching ‘c.factor’ was found
# It was found in the following places
# namespace:data.table
# with value
#
# function (...)
# {
# args <- list(...)
# for (i in seq_along(args)) if (!is.factor(args[[i]]))
# args[[i]] = as.factor(args[[i]])
# newlevels = unique(unlist(lapply(args, levels), recursive = TRUE,
# use.names = TRUE))
# ind <- fastorder(list(newlevels))
# newlevels <- newlevels[ind]
# nm <- names(unlist(args, recursive = TRUE, use.names = TRUE))
# ans = unlist(lapply(args, function(x) {
# m = match(levels(x), newlevels)
# m[as.integer(x)]
# }))
structure(ans, levels = newlevels, names = nm, class = "factor")
}
<bytecode: 0x073f7f70>
<environment: namespace:data.table
So yes, this is a bug. It is now reported as #5019.
As of version 1.8.11 data.table
will combine ordered factors to result in ordered
if a global order exists, and will complain and result in a factor if it doesn't exist:
DT1 = data.table(ordered('a', levels = c('a','b','c')))
DT2 = data.table(ordered('a', levels = c('a','d','b')))
rbind(DT1, DT2)$V1
#[1] a a
#Levels: a < d < b < c
DT3 = data.table(ordered('a', levels = c('b','a','c')))
rbind(DT1, DT3)$V1
#[1] a a
#Levels: a b c
#Warning message:
#In rbindlist(lapply(seq_along(allargs), function(x) { :
# ordered factor levels cannot be combined, going to convert to simple factor instead
To contrast, here's what base R does:
rbind(data.frame(DT1), data.frame(DT2))$V1
#[1] a a
#Levels: a < b < c < d
# Notice that the resulting order does not respect the suborder for DT2
rbind(data.frame(DT1), data.frame(DT3))$V1
#[1] a a
#Levels: a < b < c
# Again, suborders are not respected and new order is created
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With