Joining data frames without returning all matching combinations

Question

I have a list of data.frames (in this example only 2):

set.seed(1)

df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)

df.list <- list(df1,df2)

I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.

If I use:

library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")

The shared column names, which I'm not joining by, get mutated with the x and y suffices:

  id       val.x       val1     val.y       val2
1  G -0.05612874  0.2914462  2.087167  0.7876396
2  G -0.05612874  0.2914462 -0.255027  1.4411577
3  J -0.15579551 -0.4432919 -1.286301  1.0273924

In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.

I don't know these shared column names in advance but that's not difficult find out:

E.g.:

df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]

Which will then allow me to separate them from the data.frames in the list:

repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
  unique()

I can then join the list of data.frames excluding these columns:

And then join them as above:

for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")

And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.

Is there anything less cumbersome for this situation?

Chase · Accepted Answer

If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.

Something like this:

library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)

fun <- function(df1, df2, by_col = "id"){
  df1_names <- names(df1)
  df2_names <- names(df2)
  dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
  out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
  return(out)
}

df_chase <- df.list %>% reduce(fun,by_col="id")

^{Created on 2019-01-15 by the reprex package (v0.2.1)}

If I compare df_chase to your final solution, I yield the same answer:

> all.equal(df_chase, df_orig)
[1] TRUE

Joining data frames without returning all matching combinations

Tags:

join

dataframe

r

dplyr

purrr

dan

1 Answers

Chase

Recent Activity

Donate For Us

Joining data frames without returning all matching combinations

Tags:

join

dataframe

r

dplyr

purrr

dan

1 Answers

Chase

Related questions

Recent Activity

Donate For Us