Just wondering if there is an efficient way to do outer joins with data table such as
a <- data.table(a=c(1,2,3),b=c(3,4,5))
b <- data.table(a=c(1,2),k=c(1,2))
merge(a,b,by="a",all.x=T)
this works fine, but it is not as efficient as the inner join with bigger data, as the following runs very fast, but the above is really slow.
setkey(a,a)
setkey(b,a)
a[b,]
To perform outer join or full outer join use either merge() function, dplyr full_join() function, or use reduce() from tidyverse. Using the dplyr function is the best approach as it runs faster than the R base approach. dplyr package provides several functions to join data frames in R.
If you want to join by multiple variables, then you need to specify a vector of variable names: by = c("var1", "var2", "var3") . Here all three columns must match in both tables. If you want to use all variables that appear in both tables, then you can leave the by argument blank.
If the columns you want to join by don't have the same name, you need to tell merge which columns you want to join by: by. x for the x data frame column name, and by. y for the y one, such as merge(df1, df2, by. x = "df1ColName", by.
b[a,]
is the "outer join" you're looking for.
Take a look at ?merge.data.table
for more specifics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With