I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table? <pre class="prettyprint"><code>library(data.table) aa <- c(1,2,3) bb <- c(2,3,4) cc <- c(3,4,5) dt.1 <- data.table(cbind(aa, bb)) dt.2 <- data.table(cbind(aa, bb, cc)) dt.11 <- rbind(dt.1, dt.1) # Works, but not what I need dt.12 <- rbind(dt.1, dt.2) # What I need, doesn't work dt.12 <- rbind.fill(dt.1, dt.2) # What I need, doesn't work either </code></pre> I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.

Since <code>v1.9.2</code>, <code>data.table</code>'s rbind function gained <code>fill</code> argument. From <code>?rbind.data.table</code> documentation: <blockquote> If TRUE fills missing columns with NAs. By default FALSE. When TRUE, use.names has to be TRUE, and all items of the input list has to have non-null column names. </blockquote> Thus you can do (prior to approx v1.9.6): <pre class="prettyprint"><code>data.table::rbind(dt.1, dt.2, fill=TRUE) # aa bb cc # 1: 1 2 NA # 2: 2 3 NA # 3: 3 4 NA # 4: 1 2 3 # 5: 2 3 4 # 6: 3 4 5 </code></pre> UPDATE for v1.9.6: This now works directly: <pre class="prettyprint"><code>rbind(dt.1, dt.2, fill=TRUE) # aa bb cc # 1: 1 2 NA # 2: 2 3 NA # 3: 3 4 NA # 4: 1 2 3 # 5: 2 3 4 # 6: 3 4 5 </code></pre>

Yet another way to insert the missing columns (with the correct type and NAs) is to <code>merge()</code> the first data.table <code>A</code> with an empty data.table <code>A2[0]</code> which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know <code>merge()</code> is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below. Also, using <code>rbindlist()</code> should be much faster when dealing with <code>data.tables</code>. Define the tables (same as mnel's code above): <pre class="prettyprint"><code>library(data.table) A <- data.table(a=1:3, b=1:3, c=1:3) A2 <- data.table(a=6:9, b=6:9, c=6:9) B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3]) C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9) </code></pre> Insert the missing variables in table A: (note the use of <code>A2[0]</code> <pre class="prettyprint"><code>A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE) </code></pre> Insert the missing columns in table A2: <pre class="prettyprint"><code>A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE) </code></pre> Now <code>A</code> and <code>A2</code> should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if <code>rbindlist()</code> binds across column names or column positions): <pre class="prettyprint"><code>setcolorder(A2, names(A)) DT.ALL <- rbindlist(l=list(A,A2)) DT.ALL </code></pre> Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand... <pre class="prettyprint"><code>DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE) B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE) setcolorder(B, names(DT.ALL)) DT.ALL <- rbindlist(l=list(DT.ALL, B)) DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE) C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE) setcolorder(C, names(DT.ALL)) DT.ALL <- rbindlist(l=list(DT.ALL, C)) DT.ALL </code></pre> The result looks the same as mnels' output (except for the random numbers and the column order). PS1: The original author does not say what to do if there are matching variables -- do we really want to do a <code>rbind()</code> or are we thinking of a <code>merge()</code>? PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of <code>data.table</code> vs. <code>plyr</code> with large datasets.

Rbind with new columns and data.table

Tags:

r

data.table

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?

library(data.table)

aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)

dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))

dt.11 <- rbind(dt.1, dt.1)  # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2)  # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2)  # What I need, doesn't work either

I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.

523

asked Feb 21 '13 23:02

Chris

5 Answers

Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:

If TRUE fills missing columns with NAs. By default FALSE. When TRUE, use.names has to be TRUE, and all items of the input list has to have non-null column names.

Thus you can do (prior to approx v1.9.6):

data.table::rbind(dt.1, dt.2, fill=TRUE) 
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5

UPDATE for v1.9.6:

This now works directly:

rbind(dt.1, dt.2, fill=TRUE)
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5

130

answered Oct 05 '22 22:10

Daniel Krizian

Here is an approach that will update the missing columns in

rbind.missing <- function(A, B) { 

  cols.A <- names(A)
  cols.B <- names(B)

  missing.A <- setdiff(cols.B,cols.A)
  # check and define missing columns in A
  if(length(missing.A) > 0L){
   # .. means "look up one level"
   class.missing.A <- lapply(B[, ..missing.A], class)
   nas.A <- lapply(class.missing.A, as, object = NA)
   A[,c(missing.A) := nas.A]
  }
  # check and define missing columns in B
  missing.B <- setdiff(names(A), cols.B)
  if(length(missing.B) > 0L){
    class.missing.B <- lapply(A[, ..missing.B], class)
    nas.B <- lapply(class.missing.B, as, object = NA)
    B[,c(missing.B) := nas.B]
  }
  # reorder so they are the same
  setcolorder(B, names(A))
  rbind(A, B)

}

rbind.missing(dt.1,dt.2)

##    aa bb cc
## 1:  1  2 NA
## 2:  2  3 NA
## 3:  3  4 NA
## 4:  1  2  3
## 5:  2  3  4
## 6:  3  4  5

This will not be efficient for many, or large data.tables, as it only works two at a time.

answered Oct 06 '22 00:10

mnel

The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.

answered Oct 06 '22 00:10

KarthikS

the basic concept is to add missing columns in both directions: from the running master table to the newTable and back the other way.

As @menl pointed out in the comments, simply assigning an NA is a problem, because that will make the whole column of class logical.

One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.

Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class) as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName) on the LHS to assign.

We can then wrap this in a function and use S3 methods so that we can simply call

rbindFill(A, B)

Below is the function.

rbindFill.data.table <- function(master, newTable)  {
# Append newTable to master

    # assign to Master
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(newTable), names(master))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
           # identify class of each
            colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))

            # assign to each column value of NA with appropriate class 
            master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
          }

    # assign to newTable
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(master), names(newTable))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
        # identify class of each
        colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))

        # assign to each column value of NA with appropriate class 
        newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
      }

    # reorder columns to avoid warning about ordering
    #-----------------#
      colOrdering <- colOrderingByOtherCol(newTable, names(master))
      setcolorder(newTable,  colOrdering)

    # rbind them! 
    #-----------------#
      rbind(master, newTable)
  }

  # implement generic function
  rbindFill <- function(x, y, ...) UseMethod("rbindFill")

Example Usage:

    # Sample Data: 
    #--------------------------------------------------#
    A  <- data.table(a=1:3, b=1:3, c=1:3)
    A2 <- data.table(a=6:9, b=6:9, c=6:9)
    B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
    C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
    #--------------------------------------------------#

    # Four iterations of calling rbindFill
    master <- rbindFill(A, B)
    master <- rbindFill(master, A2)
    master <- rbindFill(master, C)

    # Results:
    master
    #      a  b c  d  m     n     f
    #  1:  1  1 1 NA NA    NA    NA
    #  2:  2  2 2 NA NA    NA    NA
    #  3:  3  3 3 NA NA    NA    NA
    #  4: NA  1 1  1  A    NA    NA
    #  5: NA  2 2  2  B    NA    NA
    #  6: NA  3 3  3  C    NA    NA
    #  7:  6  6 6 NA NA    NA    NA
    #  8:  7  7 7 NA NA    NA    NA
    #  9:  8  8 8 NA NA    NA    NA
    # 10:  9  9 9 NA NA    NA    NA
    # 11: NA NA 7 NA NA  0.86  TRUE
    # 12: NA NA 8 NA NA -1.15 FALSE
    # 13: NA NA 9 NA NA  1.10  TRUE

answered Oct 05 '22 22:10

Ricardo Saporta

Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.

Also, using rbindlist() should be much faster when dealing with data.tables.

Define the tables (same as mnel's code above):

library(data.table)
A  <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)

Insert the missing variables in table A: (note the use of A2[0]

A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)

Insert the missing columns in table A2:

A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)

Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):

setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL

Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...

DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))

DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL

The result looks the same as mnels' output (except for the random numbers and the column order).

PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?

PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

answered Oct 05 '22 22:10

Peter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Rbind with new columns and data.table

Tags:

r

data.table

Chris

People also ask

5 Answers

Daniel Krizian

mnel

KarthikS

Ricardo Saporta

Peter

Recent Activity

Donate For Us

Rbind with new columns and data.table

Tags:

r

data.table

Chris

People also ask

5 Answers

Daniel Krizian

mnel

KarthikS

Ricardo Saporta

Peter

Related questions

Recent Activity

Donate For Us