I have two R dataframes I want to merge. In straight R you can do:
cost <- data.frame(farm=c('farm A', 'office'), cost=c(10, 100)) trees <- data.frame(farm=c('farm A', 'farm B'), trees=c(20,30)) merge(cost, trees, all=TRUE)
which produces:
farm cost trees 1 farm A 10 20 2 office 100 NA 3 farm B NA 30
I am using dplyr
, and would prefer a solution such as:
left_join(cost, trees)
which produces something close to what I want:
farm cost trees 1 farm A 10 20 2 office 100 NA
In dplyr
I can see left_join
, inner_join
, semi_join
and anti-join
, but none of these does what merge
with all=TRUE
does.
Also - is there a quick way to set the NAs to 0? My efforts so far using x$trees[is.na(x$trees)] <- 0;
are laborious (I need a command per column) and don't always seem to work.
thanks
The beauty of dplyr is that it handles four types of joins similar to SQL: left_join() – To merge two datasets and keep all observations from the origin table. right_join() – To merge two datasets and keep all observations from the destination table. inner_join() – To merge two datasets and exclude all unmatched rows.
The join() functions from dplyr preserve the original order of rows in the data frames while the merge() function automatically sorts the rows alphabetically based on the column you used to perform the join.
Full join: The full outer join returns all of the records in a new table, whether it matches on either the left or right tables. If the table rows match, then a join will be executed, otherwise it will return NULL in places where a matching row does not exist.
To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
The most recent version of dplyr
(0.4.0) now has a full_join option, which is what I believe you want.
cost <- data.frame(farm=c('farm A', 'office'), cost=c(10, 100)) trees <- data.frame(farm=c('farm A', 'farm B'), trees=c(20,30)) merge(cost, trees, all=TRUE)
Returns
> merge(cost, trees, all=TRUE) farm cost trees 1 farm A 10 20 2 office 100 NA 3 farm B NA 30
And
library(dplyr) full_join(cost, trees)
Returns
> full_join(cost, trees) Joining by: "farm" farm cost trees 1 farm A 10 20 2 office 100 NA 3 farm B NA 30 Warning message: joining factors with different levels, coercing to character vector
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With