Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

do.call rbind of data.table depends on location of NA

Consider this

do.call(rbind, list(data.table(x=1, b='x'),data.table(x=1, b=NA)))

returns

   x  b
1: 1  x
2: 1 NA

but

do.call(rbind, list(data.table(x=1, b=NA),data.table(x=1, b='x')))

returns

   x  b
1: 1 NA
2: 1 NA

How can i force the first behavior, without reordering the contents of the list?

Data table is really really faster in mapreduce jobs (calling data.table ~10*3MM times across 55 nodes, the data table is many many times faster than data frame, so i want this to work ...) Regards saptarshi

like image 290
Sapsi Avatar asked Aug 27 '13 20:08

Sapsi


1 Answers

As noted by Frank, the problem is that there are (somewhat invisibly) several different types of NA. The one produced when you type NA at the command line is of class "logical", but there are also NA_integer_, NA_real_, NA_character_, and NA_complex_.

In your first example, the initial data.table sets the class of column b to "character", and the NA in the second data.table is then coerced to an NA_character_. In the second example, though, the NA in the first data.table sets column b's class to "logical", and, when the same column in the second data.table is coerced to "logical", it's converted to a logical NA. (Try as.logical("x") to see why.)

That's all fairly complicated (to articulate, at least), but there is a reasonably simple solution. Just create a 1-row template data.table, and prepend it to each list of data.table's you want to rbind(). It will establish the class of each column to be what you want, regardless of what data.table's follow it in the list passed to rbind(), and can be trimmed off once everything else is bound together.

library(data.table)    

## The two lists of data.tables from the OP
A <- list(data.table(x=1, b='x'),data.table(x=1, b=NA))
B <- list(data.table(x=1, b=NA),data.table(x=1, b='x'))

## A 1-row template, used to set the column types (and then removed)
DT <- data.table(x=numeric(1), b=character(1))

## Test it out
do.call(rbind, c(list(DT), A))[-1,]
#    x  b
# 1: 1  x
# 2: 1 NA
do.call(rbind, c(list(DT), B))[-1,]
#    x  b
# 1: 1 NA
# 2: 1  x

## Finally, as _also_ noted by Frank, rbindlist will likely be more efficient
rbindlist(c(list(DT), B)[-1,]
like image 121
Josh O'Brien Avatar answered Sep 29 '22 11:09

Josh O'Brien