Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Joining data frames without returning all matching combinations

I have a list of data.frames (in this example only 2):

set.seed(1)

df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)

df.list <- list(df1,df2)

I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.

If I use:

library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")

The shared column names, which I'm not joining by, get mutated with the x and y suffices:

  id       val.x       val1     val.y       val2
1  G -0.05612874  0.2914462  2.087167  0.7876396
2  G -0.05612874  0.2914462 -0.255027  1.4411577
3  J -0.15579551 -0.4432919 -1.286301  1.0273924

In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.

I don't know these shared column names in advance but that's not difficult find out:

E.g.:

df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]

Which will then allow me to separate them from the data.frames in the list:

repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
  unique()

I can then join the list of data.frames excluding these columns:

And then join them as above:

for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")

And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.

Is there anything less cumbersome for this situation?

like image 665
dan Avatar asked May 21 '26 20:05

dan


1 Answers

If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.

Something like this:

library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)

fun <- function(df1, df2, by_col = "id"){
  df1_names <- names(df1)
  df2_names <- names(df2)
  dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
  out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
  return(out)
}

df_chase <- df.list %>% reduce(fun,by_col="id")

Created on 2019-01-15 by the reprex package (v0.2.1)

If I compare df_chase to your final solution, I yield the same answer:

> all.equal(df_chase, df_orig)
[1] TRUE
like image 192
Chase Avatar answered May 24 '26 15:05

Chase



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!