Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge two dataframe based on matching two exchangable columns in each dataframe

I have two dataframe in R.

dataframe 1

A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e

dataframe 2

X Y Z
1 2 g
2 1 h
3 4 i
1 4 j

I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.

When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.

Final dataframe

X Y Z C  D  E  F  G
1 2 g a  a  a  a  a 
2 1 h a  a  a  a  a
3 4 i na na na na na
1 4 j e  e  f  f  e

I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.

For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.

df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'), 
                                D=c('a','b','e'), E=c('a','b','f'), 
                                F=c('a','c','f'), G=c('a','c', 'e'))

df2  <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])
like image 805
a83 Avatar asked May 25 '11 05:05

a83


1 Answers

The following works, but no doubt can be improved.

I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).

replace_index <- function(dat){
  x <- as.data.frame(t(sapply(seq_len(nrow(dat)), 
    function(i)sort(unlist(dat[i, 1:2])))))
  names(x) <- paste("V", seq_len(ncol(x)), sep="")
  data.frame(x, dat[, -(1:2), drop=FALSE])
} 

replace_index(df1)

  V1 V2 C D E F G
1  1  2 a a a a a
2  2  3 b b b c c
3  1  4 e e f f e

This means you can use a straight-forward merge to combine the data.

merge(replace_index(df1), replace_index(df2), all.y=TRUE)

  V1 V2    C    D    E    F    G Z
1  1  2    a    a    a    a    a g
2  1  2    a    a    a    a    a h
3  1  4    e    e    f    f    e j
4  3  4 <NA> <NA> <NA> <NA> <NA> i
like image 90
Andrie Avatar answered Oct 14 '22 06:10

Andrie