I just discovered this warning in my script that was a bit strange.
# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion
Observation 1: Here's a reproducible example:
require(data.table)
DT.1 <- data.table(x = letters[1:5], y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)
# works fine
rbindlist(list(DT.1, DT.2))
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: A 11
# 7: B 12
# 8: C 13
# 9: D 14
# 10: E 15
However, now if I convert column x
to a factor
(ordered or not) and do the same:
DT.1[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: NA 11
# 7: NA 12
# 8: NA 13
# 9: NA 14
# 10: NA 15
# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion
But rbind
does this job nicely!
rbind(DT.1, DT.2) # where DT.1 has column x as factor
# do.call(rbind, list(DT.1, DT.2)) # also works fine
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: A 11
# 7: B 12
# 8: C 13
# 9: D 14
# 10: E 15
The same behaviour can be reproduced if column x
is an ordered factor
as well. Since the help page ?rbindlist
says: Same as do.call("rbind",l), but much faster.
, I'm guessing this is not the desired behaviour?
Here's my session info:
# R version 3.0.0 (2013-04-03)
# Platform: x86_64-apple-darwin10.8.0 (64-bit)
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] data.table_1.8.8
#
# loaded via a namespace (and not attached):
# [1] tools_3.0.0
Observation 2: Following @AnandaMahto's another interesting observation, reversing the order:
# column x in DT.1 is still a factor
rbindlist(list(DT.2, DT.1))
# x y
# 1: A 11
# 2: B 12
# 3: C 13
# 4: D 14
# 5: E 15
# 6: 1 6
# 7: 2 7
# 8: 3 8
# 9: 4 9
# 10: 5 10
Here, the column from DT.1
is silently coerced to numeric
.
Edit: Just to clarify, this is the same behaviour as that of rbind(DT2, DT1)
with DT1's column x being a factor. rbind
seems to retain the class of the first argument. I'll leave this part here and mention that in this case, this is the desired behaviour since rbindlist
is a faster implementation of rbind
.
Observation 3: If now, both the columns are converted to factors:
# DT.1 column x is already a factor
DT.2[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: a 11
# 7: b 12
# 8: c 13
# 9: d 14
# 10: e 15
Here, the column x
from DT.2
is lost (/ replaced with that of DT.1
). If the order is reversed, the exact opposite happens (column x of DT.1
gets replaced with that of DT.2
).
In general, there seems to be a problem with handling factor
columns in rbindlist
.
I believe that rbindlist
when applied to factors is combining the numerical values of the factors and using only the levels associated with the first list element.
As in this bug report: http://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975
# Temporary workaround:
levs <- c(as.character(DT.1$x), as.character(DT.2$x))
DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]
rbindlist(list(DT.1, DT.2))
As another view of whats going on:
DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)
DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]
DT3
DT4
# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd
do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd
as for observation 1, what's happening is similar to:
x <- factor(LETTERS[1:5])
x[6:10] <- letters[1:5]
x
# Notice however, if you are assigning a value that is already present
x[11] <- "S" # warning, since `S` is not one of the levels of x
x[12] <- "D" # all good, since `D` *is* one of the levels of x
rbindlist
is superfast because it doesn't do the checking of rbindfill
or do.call(rbind.data.frame,...)
You can use a workaround like this to ensure that factors are coerced to characters.
DT.1 <- data.table(x = factor(letters[1:5]), y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)
for(ii in seq_along(DDL)){
ff <- Filter(function(x) is.factor(DDL[[ii]][[x]]), names(DDL[[ii]]))
for(fn in ff){
set(DDL[[ii]], j = fn, value = as.character(DDL[[ii]][[fn]]))
}
}
rbindlist(DDL)
or (less memory efficiently)
rbindlist(rapply(DDL, classes = 'factor', f= as.character, how = 'replace'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With