Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simultaneously merge multiple data.frames in a list

People also ask

Can you merge multiple Dataframes at once?

Pandas merge() function is used to merge multiple Dataframes. We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes.

How do I merge Dataframes in a list in R?

If we want to merge more than two dataframes we can use cbind() function and pass the resultant cbind() variable into as. list() function to convert it into list .

How do I merge a list of Dataframes?

To join a list of DataFrames, say dfs , use the pandas. concat(dfs) function that merges an arbitrary number of DataFrames to a single one.

How do I merge two data frames in the same column in R?

The merge() function in base R can be used to merge input dataframes by common columns or row names. The merge() function retains all the row names of the dataframes, behaving similarly to the inner join. The dataframes are combined in order of the appearance in the input function call.


Another question asked specifically how to perform multiple left joins using dplyr in R . The question was marked as a duplicate of this one so I answer here, using the 3 sample data frames below:

x <- data.frame(i = c("a","b","c"), j = 1:3, stringsAsFactors=FALSE)
y <- data.frame(i = c("b","c","d"), k = 4:6, stringsAsFactors=FALSE)
z <- data.frame(i = c("c","d","a"), l = 7:9, stringsAsFactors=FALSE)

Update June 2018: I divided the answer in three sections representing three different ways to perform the merge. You probably want to use the purrr way if you are already using the tidyverse packages. For comparison purposes below, you'll find a base R version using the same sample dataset.


1) Join them with reduce from the purrr package:

The purrr package provides a reduce function which has a concise syntax:

library(tidyverse)
list(x, y, z) %>% reduce(left_join, by = "i")
#  A tibble: 3 x 4
#  i       j     k     l
#  <chr> <int> <int> <int>
# 1 a      1    NA     9
# 2 b      2     4    NA
# 3 c      3     5     7

You can also perform other joins, such as a full_join or inner_join:

list(x, y, z) %>% reduce(full_join, by = "i")
# A tibble: 4 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 a     1     NA     9
# 2 b     2     4      NA
# 3 c     3     5      7
# 4 d     NA    6      8

list(x, y, z) %>% reduce(inner_join, by = "i")
# A tibble: 1 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 c     3     5     7

2) dplyr::left_join() with base R Reduce():

list(x,y,z) %>%
    Reduce(function(dtf1,dtf2) left_join(dtf1,dtf2,by="i"), .)

#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7

3) Base R merge() with base R Reduce():

And for comparison purposes, here is a base R version of the left join based on Charles's answer.

 Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "i", all.x = TRUE),
        list(x,y,z))
#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7

Reduce makes this fairly easy:

merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)

Here's a fully example using some mock data:

set.seed(1)
list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
#    x  a  b         y
#12 12 NA 18        NA
#13 13 NA 19        NA
#14 14 NA 20 0.4976992
#15 15 NA NA 0.7176185
#16 16 NA NA 0.3841037
#17 19 NA NA 0.3800352

And here's an example using these data to replicate my.list:

merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list)
merged.data.frame[, 1:12]

#  matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y
#1   ALGIERE   200 RI      026       S         NA   <NA>   NA   NA   NA         NA   <NA>
#2     ALVES   100 RI      019       S         NA   <NA>   NA   NA   NA         NA   <NA>
#3    BADEAU   100 RI      032       S         NA   <NA>   NA   NA   NA         NA   <NA>

Note: It looks like this is arguably a bug in merge. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.frame which does make.unique the names, causing the rbind to fail.

# first merge will end up with 'name.x' & 'name.y'
merge(my.list[[1]], my.list[[2]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y
#<0 rows> (or 0-length row.names)
# as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again
merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y senate1995   name         votes.year  
#<0 rows> (or 0-length row.names)
# the next merge will fail as 'name' will get renamed to a pre-existing field.

Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge. Eg:

my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,
      names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list))

The merge/Reduce will then work fine.


You can do it using merge_all in the reshape package. You can pass parameters to merge using the ... argument

reshape::merge_all(list_of_dataframes, ...)

Here is an excellent resource on different methods to merge data frames.


You can use recursion to do this. I haven't verified the following, but it should give you the right idea:

MergeListOfDf = function( data , ... )
{
    if ( length( data ) == 2 ) 
    {
        return( merge( data[[ 1 ]] , data[[ 2 ]] , ... ) )
    }    
    return( merge( MergeListOfDf( data[ -1 ] , ... ) , data[[ 1 ]] , ... ) )
}

I will reuse the data example from @PaulRougieux

x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)

Here's a short and sweet solution using purrr and tidyr

library(tidyverse)

 list(x, y, z) %>% 
  map_df(gather, key=key, value=value, -i) %>% 
  spread(key, value)

The function eat of my package safejoin has such feature, if you give it a list of data.frames as a second input it will join them recursively to the first input.

Borrowing and extending the accepted answer's data :

x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)
z2 <- data_frame(i = c("a","b","c"), l = rep(100L,3),l2 = rep(100L,3)) # for later

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
eat(x, list(y,z), .by = "i")
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <int>
# 1 a         1    NA     9
# 2 b         2     4    NA
# 3 c         3     5     7

We don't have to take all columns, we can use select helpers from tidyselect and choose (as we start from .x all .x columns are kept):

eat(x, list(y,z), starts_with("l") ,.by = "i")
# # A tibble: 3 x 3
#   i         j     l
#   <chr> <int> <int>
# 1 a         1     9
# 2 b         2    NA
# 3 c         3     7

or remove specific ones:

eat(x, list(y,z), -starts_with("l") ,.by = "i")
# # A tibble: 3 x 3
#   i         j     k
#   <chr> <int> <int>
# 1 a         1    NA
# 2 b         2     4
# 3 c         3     5

If the list is named the names will be used as prefixes :

eat(x, dplyr::lst(y,z), .by = "i")
# # A tibble: 3 x 4
#   i         j   y_k   z_l
#   <chr> <int> <int> <int>
# 1 a         1    NA     9
# 2 b         2     4    NA
# 3 c         3     5     7

If there are column conflicts the .conflict argument allows you to resolve it, for example by taking the first/second one, adding them, coalescing them, or nesting them.

keep first :

eat(x, list(y, z, z2), .by = "i", .conflict = ~.x)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <int>
# 1 a         1    NA     9
# 2 b         2     4    NA
# 3 c         3     5     7

keep last:

eat(x, list(y, z, z2), .by = "i", .conflict = ~.y)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <dbl>
# 1 a         1    NA   100
# 2 b         2     4   100
# 3 c         3     5   100

add:

eat(x, list(y, z, z2), .by = "i", .conflict = `+`)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <dbl>
# 1 a         1    NA   109
# 2 b         2     4    NA
# 3 c         3     5   107

coalesce:

eat(x, list(y, z, z2), .by = "i", .conflict = dplyr::coalesce)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <dbl>
# 1 a         1    NA     9
# 2 b         2     4   100
# 3 c         3     5     7

nest:

eat(x, list(y, z, z2), .by = "i", .conflict = ~tibble(first=.x, second=.y))
# # A tibble: 3 x 4
#   i         j     k l$first $second
#   <chr> <int> <int>   <int>   <int>
# 1 a         1    NA       9     100
# 2 b         2     4      NA     100
# 3 c         3     5       7     100

NA values can be replaced by using the .fill argument.

eat(x, list(y, z), .by = "i", .fill = 0)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <dbl> <dbl>
# 1 a         1     0     9
# 2 b         2     4     0
# 3 c         3     5     7

By default it's an enhanced left_join but all dplyr joins are supported through the .mode argument, fuzzy joins are also supported through the match_fun argument (it's wrapped around the package fuzzyjoin) or giving a formula such as ~ X("var1") > Y("var2") & X("var3") < Y("var4") to the by argument.