Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rbind with new columns and data.table

Tags:

r

data.table

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?

library(data.table)

aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)

dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))

dt.11 <- rbind(dt.1, dt.1)  # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2)  # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2)  # What I need, doesn't work either

I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.

like image 523
Chris Avatar asked Feb 21 '13 23:02

Chris


People also ask

What does rbind() do in R?

rbind() function in R Language is used to combine specified Vector, Matrix or Data Frame by rows. deparse. level: This value determines how the column names generated. The default value of deparse.

When to use rbind?

The name of the rbind R function stands for row-bind. The rbind function can be used to combine several vectors, matrices and/or data frames by rows.

What is faster than Rbind?

table is the fastest with average execution time 428 milliseconds. It's more than twice faster than bind_rows from dplyr , which took an average of 1,050 milliseconds, and more than 10 times faster than rbind from base R, which took an average of 5,358 milliseconds!


5 Answers

Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:

If TRUE fills missing columns with NAs. By default FALSE. When TRUE, use.names has to be TRUE, and all items of the input list has to have non-null column names.

Thus you can do (prior to approx v1.9.6):

data.table::rbind(dt.1, dt.2, fill=TRUE) 
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5

UPDATE for v1.9.6:

This now works directly:

rbind(dt.1, dt.2, fill=TRUE)
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5
like image 130
Daniel Krizian Avatar answered Oct 05 '22 22:10

Daniel Krizian


Here is an approach that will update the missing columns in

rbind.missing <- function(A, B) { 

  cols.A <- names(A)
  cols.B <- names(B)

  missing.A <- setdiff(cols.B,cols.A)
  # check and define missing columns in A
  if(length(missing.A) > 0L){
   # .. means "look up one level"
   class.missing.A <- lapply(B[, ..missing.A], class)
   nas.A <- lapply(class.missing.A, as, object = NA)
   A[,c(missing.A) := nas.A]
  }
  # check and define missing columns in B
  missing.B <- setdiff(names(A), cols.B)
  if(length(missing.B) > 0L){
    class.missing.B <- lapply(A[, ..missing.B], class)
    nas.B <- lapply(class.missing.B, as, object = NA)
    B[,c(missing.B) := nas.B]
  }
  # reorder so they are the same
  setcolorder(B, names(A))
  rbind(A, B)

}

rbind.missing(dt.1,dt.2)

##    aa bb cc
## 1:  1  2 NA
## 2:  2  3 NA
## 3:  3  4 NA
## 4:  1  2  3
## 5:  2  3  4
## 6:  3  4  5

This will not be efficient for many, or large data.tables, as it only works two at a time.

like image 34
mnel Avatar answered Oct 06 '22 00:10

mnel


The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.

like image 20
KarthikS Avatar answered Oct 06 '22 00:10

KarthikS


the basic concept is to add missing columns in both directions: from the running master table to the newTable and back the other way.

As @menl pointed out in the comments, simply assigning an NA is a problem, because that will make the whole column of class logical.

One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.

Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class) as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName) on the LHS to assign.

We can then wrap this in a function and use S3 methods so that we can simply call

rbindFill(A, B)

Below is the function.

rbindFill.data.table <- function(master, newTable)  {
# Append newTable to master

    # assign to Master
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(newTable), names(master))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
           # identify class of each
            colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))

            # assign to each column value of NA with appropriate class 
            master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
          }

    # assign to newTable
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(master), names(newTable))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
        # identify class of each
        colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))

        # assign to each column value of NA with appropriate class 
        newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
      }

    # reorder columns to avoid warning about ordering
    #-----------------#
      colOrdering <- colOrderingByOtherCol(newTable, names(master))
      setcolorder(newTable,  colOrdering)

    # rbind them! 
    #-----------------#
      rbind(master, newTable)
  }

  # implement generic function
  rbindFill <- function(x, y, ...) UseMethod("rbindFill")


Example Usage:

    # Sample Data: 
    #--------------------------------------------------#
    A  <- data.table(a=1:3, b=1:3, c=1:3)
    A2 <- data.table(a=6:9, b=6:9, c=6:9)
    B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
    C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
    #--------------------------------------------------#

    # Four iterations of calling rbindFill
    master <- rbindFill(A, B)
    master <- rbindFill(master, A2)
    master <- rbindFill(master, C)

    # Results:
    master
    #      a  b c  d  m     n     f
    #  1:  1  1 1 NA NA    NA    NA
    #  2:  2  2 2 NA NA    NA    NA
    #  3:  3  3 3 NA NA    NA    NA
    #  4: NA  1 1  1  A    NA    NA
    #  5: NA  2 2  2  B    NA    NA
    #  6: NA  3 3  3  C    NA    NA
    #  7:  6  6 6 NA NA    NA    NA
    #  8:  7  7 7 NA NA    NA    NA
    #  9:  8  8 8 NA NA    NA    NA
    # 10:  9  9 9 NA NA    NA    NA
    # 11: NA NA 7 NA NA  0.86  TRUE
    # 12: NA NA 8 NA NA -1.15 FALSE
    # 13: NA NA 9 NA NA  1.10  TRUE
like image 36
Ricardo Saporta Avatar answered Oct 05 '22 22:10

Ricardo Saporta


Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.

Also, using rbindlist() should be much faster when dealing with data.tables.

Define the tables (same as mnel's code above):

library(data.table)
A  <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)

Insert the missing variables in table A: (note the use of A2[0]

A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)

Insert the missing columns in table A2:

A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)

Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):

setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL

Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...

DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))

DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL

The result looks the same as mnels' output (except for the random numbers and the column order).

PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?

PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

like image 26
Peter Avatar answered Oct 05 '22 22:10

Peter