Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rbindlist two data.tables where one has factor and other has character type for a column

Tags:

r

data.table

I just discovered this warning in my script that was a bit strange.

# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion

Observation 1: Here's a reproducible example:

require(data.table)
DT.1 <- data.table(x = letters[1:5], y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)

# works fine
rbindlist(list(DT.1, DT.2))
#     x  y
#  1: a  6
#  2: b  7
#  3: c  8
#  4: d  9
#  5: e 10
#  6: A 11
#  7: B 12
#  8: C 13
#  9: D 14
# 10: E 15

However, now if I convert column x to a factor (ordered or not) and do the same:

DT.1[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
#      x  y
#  1:  a  6
#  2:  b  7
#  3:  c  8
#  4:  d  9
#  5:  e 10
#  6: NA 11
#  7: NA 12
#  8: NA 13
#  9: NA 14
# 10: NA 15
# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion

But rbind does this job nicely!

rbind(DT.1, DT.2) # where DT.1 has column x as factor
# do.call(rbind, list(DT.1, DT.2)) # also works fine
#     x  y
#  1: a  6
#  2: b  7
#  3: c  8
#  4: d  9
#  5: e 10
#  6: A 11
#  7: B 12
#  8: C 13
#  9: D 14
# 10: E 15

The same behaviour can be reproduced if column x is an ordered factor as well. Since the help page ?rbindlist says: Same as do.call("rbind",l), but much faster., I'm guessing this is not the desired behaviour?


Here's my session info:

# R version 3.0.0 (2013-04-03)
# Platform: x86_64-apple-darwin10.8.0 (64-bit)
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.8.8
# 
# loaded via a namespace (and not attached):
# [1] tools_3.0.0

Edit:

Observation 2: Following @AnandaMahto's another interesting observation, reversing the order:

# column x in DT.1 is still a factor
rbindlist(list(DT.2, DT.1))
#     x  y
#  1: A 11
#  2: B 12
#  3: C 13
#  4: D 14
#  5: E 15
#  6: 1  6
#  7: 2  7
#  8: 3  8
#  9: 4  9
# 10: 5 10

Here, the column from DT.1 is silently coerced to numeric.
Edit: Just to clarify, this is the same behaviour as that of rbind(DT2, DT1) with DT1's column x being a factor. rbind seems to retain the class of the first argument. I'll leave this part here and mention that in this case, this is the desired behaviour since rbindlist is a faster implementation of rbind.

Observation 3: If now, both the columns are converted to factors:

# DT.1 column x is already a factor
DT.2[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
#     x  y
#  1: a  6
#  2: b  7
#  3: c  8
#  4: d  9
#  5: e 10
#  6: a 11
#  7: b 12
#  8: c 13
#  9: d 14
# 10: e 15

Here, the column x from DT.2 is lost (/ replaced with that of DT.1). If the order is reversed, the exact opposite happens (column x of DT.1 gets replaced with that of DT.2).

In general, there seems to be a problem with handling factor columns in rbindlist.

like image 438
Arun Avatar asked Apr 10 '13 18:04

Arun


2 Answers

UPDATE - This bug (#2650) was fixed on 17 May 2013 in v1.8.9


I believe that rbindlist when applied to factors is combining the numerical values of the factors and using only the levels associated with the first list element.

As in this bug report: http://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975


# Temporary workaround: 

levs <- c(as.character(DT.1$x), as.character(DT.2$x))

DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]

rbindlist(list(DT.1, DT.2))

As another view of whats going on:

DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)

DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]

DT3
DT4

# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd

do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd

Edit as per comments:

as for observation 1, what's happening is similar to:

x <- factor(LETTERS[1:5])

x[6:10] <- letters[1:5]
x

# Notice however, if you are assigning a value that is already present
x[11] <- "S"  # warning, since `S` is not one of the levels of x
x[12] <- "D"  # all good, since `D` *is* one of the levels of x
like image 121
Ricardo Saporta Avatar answered Oct 20 '22 18:10

Ricardo Saporta


rbindlist is superfast because it doesn't do the checking of rbindfill or do.call(rbind.data.frame,...)

You can use a workaround like this to ensure that factors are coerced to characters.

DT.1 <- data.table(x = factor(letters[1:5]), y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)


for(ii in seq_along(DDL)){
  ff <- Filter(function(x) is.factor(DDL[[ii]][[x]]), names(DDL[[ii]]))
  for(fn in ff){
    set(DDL[[ii]], j = fn, value = as.character(DDL[[ii]][[fn]]))
    }
  }
 rbindlist(DDL)

or (less memory efficiently)

rbindlist(rapply(DDL, classes = 'factor', f= as.character, how = 'replace'))
like image 44
mnel Avatar answered Oct 20 '22 18:10

mnel