Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine two data frames by rows (rbind) when they have different sets of columns

Tags:

dataframe

r

r-faq

People also ask

How do I combine two data frames with different number of rows?

Use the full_join Function to Merge Two R Data Frames With Different Number of Rows. full_join is part of the dplyr package, and it can be used to merge two data frames with a different number of rows.

How do you Rbind two data frames?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.


rbind.fill from the package plyr might be what you are looking for.


A more recent solution is to use dplyr's bind_rows function which I assume is more efficient than smartbind.

df1 <- data.frame(a = c(1:5), b = c(6:10))
df2 <- data.frame(a = c(11:15), b = c(16:20), c = LETTERS[1:5])
dplyr::bind_rows(df1, df2)
    a  b    c
1   1  6 <NA>
2   2  7 <NA>
3   3  8 <NA>
4   4  9 <NA>
5   5 10 <NA>
6  11 16    A
7  12 17    B
8  13 18    C
9  14 19    D
10 15 20    E

Most of the base R answers address the situation where only one data.frame has additional columns or that the resulting data.frame would have the intersection of the columns. Since the OP writes I am hoping to retain the columns that do not match after the bind, an answer using base R methods to address this issue is probably worth posting.

Below, I present two base R methods: One that alters the original data.frames, and one that doesn't. Additionally, I offer a method that generalizes the non-destructive method to more than two data.frames.

First, let's get some sample data.

# sample data, variable c is in df1, variable d is in df2
df1 = data.frame(a=1:5, b=6:10, d=month.name[1:5])
df2 = data.frame(a=6:10, b=16:20, c = letters[8:12])

Two data.frames, alter originals
In order to retain all columns from both data.frames in an rbind (and allow the function to work without resulting in an error), you add NA columns to each data.frame with the appropriate missing names filled in using setdiff.

# fill in non-overlapping columns with NAs
df1[setdiff(names(df2), names(df1))] <- NA
df2[setdiff(names(df1), names(df2))] <- NA

Now, rbind-em

rbind(df1, df2)
    a  b        d    c
1   1  6  January <NA>
2   2  7 February <NA>
3   3  8    March <NA>
4   4  9    April <NA>
5   5 10      May <NA>
6   6 16     <NA>    h
7   7 17     <NA>    i
8   8 18     <NA>    j
9   9 19     <NA>    k
10 10 20     <NA>    l

Note that the first two lines alter the original data.frames, df1 and df2, adding the full set of columns to both.


Two data.frames, do not alter originals
To leave the original data.frames intact, first loop through the names that differ, return a named vector of NAs that are concatenated into a list with the data.frame using c. Then, data.frame converts the result into an appropriate data.frame for the rbind.

rbind(
  data.frame(c(df1, sapply(setdiff(names(df2), names(df1)), function(x) NA))),
  data.frame(c(df2, sapply(setdiff(names(df1), names(df2)), function(x) NA)))
)

Many data.frames, do not alter originals
In the instance that you have more than two data.frames, you could do the following.

# put data.frames into list (dfs named df1, df2, df3, etc)
mydflist <- mget(ls(pattern="df\\d+"))
# get all variable names
allNms <- unique(unlist(lapply(mydflist, names)))

# put em all together
do.call(rbind,
        lapply(mydflist,
               function(x) data.frame(c(x, sapply(setdiff(allNms, names(x)),
                                                  function(y) NA)))))

Maybe a bit nicer to not see the row names of original data.frames? Then do this.

do.call(rbind,
        c(lapply(mydflist,
                 function(x) data.frame(c(x, sapply(setdiff(allNms, names(x)),
                                                    function(y) NA)))),
          make.row.names=FALSE))

An alternative with data.table:

library(data.table)
df1 = data.frame(a = c(1:5), b = c(6:10))
df2 = data.frame(a = c(11:15), b = c(16:20), c = LETTERS[1:5])
rbindlist(list(df1, df2), fill = TRUE)

rbind will also work in data.table as long as the objects are converted to data.table objects, so

rbind(setDT(df1), setDT(df2), fill=TRUE)

will also work in this situation. This can be preferable when you have a couple of data.tables and don't want to construct a list.


You can use smartbind from the gtools package.

Example:

library(gtools)
df1 <- data.frame(a = c(1:5), b = c(6:10))
df2 <- data.frame(a = c(11:15), b = c(16:20), c = LETTERS[1:5])
smartbind(df1, df2)
# result
     a  b    c
1.1  1  6 <NA>
1.2  2  7 <NA>
1.3  3  8 <NA>
1.4  4  9 <NA>
1.5  5 10 <NA>
2.1 11 16    A
2.2 12 17    B
2.3 13 18    C
2.4 14 19    D
2.5 15 20    E

If the columns in df1 is a subset of those in df2 (by column names):

df3 <- rbind(df1, df2[, names(df1)])