Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join (merge) data frames (inner, outer, left, right)

Tags:

r

r-faq

Given two data frames:

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3))) df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))  df1 #  CustomerId Product #           1 Toaster #           2 Toaster #           3 Toaster #           4   Radio #           5   Radio #           6   Radio  df2 #  CustomerId   State #           2 Alabama #           4 Alabama #           6    Ohio 

How can I do database style, i.e., sql style, joins? That is, how do I get:

  • An inner join of df1 and df2:
    Return only the rows in which the left table have matching keys in the right table.
  • An outer join of df1 and df2:
    Returns all rows from both tables, join records from the left which have matching keys in the right table.
  • A left outer join (or simply left join) of df1 and df2
    Return all rows from the left table, and any rows with matching keys from the right table.
  • A right outer join of df1 and df2
    Return all rows from the right table, and any rows with matching keys from the left table.

Extra credit:

How can I do a SQL style select statement?

like image 634
Dan Goldstein Avatar asked Aug 19 '09 13:08

Dan Goldstein


People also ask

Can we use left outer join and inner join together?

If they are at the same 'scope', the INNERS will always run, you can't conditionally have them run based on their order. You either have to use two LEFT joins and filter out the records you don't want, or alternatively use a View to scope the INNER JOIN.

How do I join two data frames in R?

In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS.

How do I inner join in R?

We can use the merge() function in base R to perform an inner join, using the 'team' column as the column to join on: What is this? Notice that only the teams that were in both datasets are kept in the final dataset.


1 Answers

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

like image 160
Matt Parker Avatar answered Oct 13 '22 03:10

Matt Parker