Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join a data frame to itself within a dplyr chain?

Tags:

r

dplyr

magrittr

Occasionally, I need to join a data frame to (usually a modified) version of itself within a dplyr chain. Something like this:

df  <- data.frame(
     id = c(1,2,3)
   , status = c('foo','bar','meh')
   , spouseid = c(4,3,2)
)


df %>% 
  filter( status == 'foo' | status == 'bar') %>% 
  # join the filtered table to itself using the dot as the right-hand side
  left_join(., by = c('id' = 'spouseid'))

When I try that, I get Error in is.data.frame(y) : argument "y" is missing, with no default.

like image 822
crazybilly Avatar asked Sep 14 '16 16:09

crazybilly


People also ask

How do I join a Dataframe in dplyr in R?

Joins with dplyr. dplyr uses SQL database syntax for its join functions. A left join means: Include everything on the left (what was the x data frame in merge() ) and all rows that match from the right (y) data frame. If the join columns have the same name, all you need is left_join(x, y) .

How do I inner join a Dataframe in R?

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by. x and by.

Does dplyr work with data frame?

All of the dplyr functions take a data frame (or tibble) as the first argument. Rather than forcing the user to either save intermediate objects or nest functions, dplyr provides the %>% operator from magrittr. x %>% f(y) turns into f(x, y) so the result from one step is then “piped” into the next step.

What is a self join in R?

A self-join, also known as an inner join, is a structured query language (SQL) statement where a queried table is joined to itself. The self-join statement is necessary when two sets of data, within the same table, are compared.


1 Answers

The problem is that using the dot just moves around the left hand side, so the way it's written above only passes the lhs into left_join(). To use the dot for both the left- and right-hand sides, use the dot twice:

df %>% 
  filter( status == 'foo' | status == 'bar') %>% 
  # the first dot is x argument and the second dot is the y argument
  left_join(
      x = . 
    , y = . 
    , by = c('id' = 'spouseid')
  )

This way, you're passing the lhs to both arguments of left_join() rather than relying on magrittr's implicit lhs like you normally would.

like image 77
crazybilly Avatar answered Oct 19 '22 20:10

crazybilly