How to join 2 tables using *_join() from dplyr, when the join variable names are different and stored in another variable?
e.g.
df1 = data_frame(x1 = 1:10,y1 = 21:30)
df2 = data_frame(x2 = 6:15,y2 = 26:35)
df3 = data_frame(x1 = 6:15,y2 = 26:35)
var1 = "x1"
var2 = "x2"
df1 %>% left_join(df3,by=c(var1)) # #1 works
but this gives error -
df1 %>% left_join(df2,by=c(var1 = var2)) # #2 doesn't work
Error: cannot join on columns 'x2' x 'var1': index out of bounds
surprisingly, this works -
df1 %>% left_join(df2,by=c("x1" = var2)) # #3 works
The problem here is that you have to supply a named vector if the common column has different names in the data.frames. See what happens in your example:
It works when you supply the names directly:
df1 %>% left_join(df2, by = c("x1" = "x2"))
#Source: local data frame [10 x 3]
#
# x1 y1 y2
#1 1 21 NA
#2 2 22 NA
#3 3 23 NA
#4 4 24 NA
#5 5 25 NA
#6 6 26 26
#7 7 27 27
#8 8 28 28
#9 9 29 29
#10 10 30 30
And the named vector you supply is:
c("x1" = "x2")
# x1
#"x2"
Now if you use character vectors, the named vector changes to:
var1 = "x1"
var2 = "x2"
c(var1 = var2)
#var1 # <~~ this is why it doesn't work
#"x2"
I don't know if there's a "clean" way to solve this in dplyr currently. A workaround is the following tweak to construct the named vector as required:
df1 %>% left_join(df2, by = setNames(var2, var1))
#Source: local data frame [10 x 3]
#
# x1 y1 y2
#1 1 21 NA
#2 2 22 NA
#3 3 23 NA
#4 4 24 NA
#5 5 25 NA
#6 6 26 26
#7 7 27 27
#8 8 28 28
#9 9 29 29
#10 10 30 30
This works because:
setNames(var2, var1)
# x1
#"x2"
Hope it helps.
Note: you could do the same with names<-
like so:
df1 %>% left_join(df2, by = `names<-`(var2, var1))
but Hadley recommends using the setNames
approach instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With