Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr left_join by less than, greater than condition

This question is somewhat related to issues Efficiently merging two data frames on a non-trivial criteria and Checking if date is between two dates in r. And the one I have posted here requesting if the feature exist: GitHub issue

I am looking to join two dataframes using dplyr::left_join(). The condition I use to join is less-than, greater-than i.e, <= and >. Does dplyr::left_join() support this feature? or do the keys only take = operator between them. This is straightforward to run from SQL (assuming I have the dataframe in the database)

Here is a MWE: I have two datasets one firm-year (fdata), while second is sort of survey data that happens once every five years. So for all years in the fdata that are in between two survey years, I join the corresponding survey year data.

id <- c(1,1,1,1,         2,2,2,2,2,2,         3,3,3,3,3,3,         5,5,5,5,         8,8,8,8,         13,13,13)  fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,        1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,        1998,1999,2000,2001,1998,1999,2000)  byear <- c(1990,1995,2000,2005) eyear <- c(1995,2000,2005,2010) val <- c(3,1,5,6)  sdata <- tbl_df(data.frame(byear, eyear, val))  fdata <- tbl_df(data.frame(id, fyear))  test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear")) 

I get

Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds  

Unless if left_join can handle the condition, but my syntax is missing something?

like image 730
rajvijay Avatar asked May 18 '16 02:05

rajvijay


People also ask

What is the difference between left_join and right_join in dplyr?

Figure 3: dplyr left_join Function. The difference to the inner_join function is that left_join retains all rows of the data table, which is inserted first into the function (i.e. the X-data). Have a look at the R documentation for a precise definition: Example 3: right_join dplyr R Function. Right join is the reversed brother of left join:

How to specify more than one condition in dplyr filter ()?

With dplyr’s filter () function, we can also specify more than one conditions. In the example below, we have two conditions inside filter () function, one specifies flipper length greater than 220 and second condition for sex column. We can filter dataframe for rows satisfying one of the two conditions using Boolean OR.

How does anti join work in dplyr?

Figure 7: dplyr anti_join Function. As you can see, the anti_join functions keeps only rows that are non-existent in the right-hand data AND keeps only columns of the left-hand data. The R help documentation of anti join is shown below: At this point you have learned the basic principles of the six dplyr join functions.

How to merge data based on inner_join in dplyr?

In order to merge our data based on inner_join, we simply have to specify the names of our two data frames (i.e. data1 and data2) and the column based on which we want to merge (i.e. the column ID ): Figure 2: dplyr inner_join Function. Figure 2 illustrates the output of the inner join that we have just performed.


2 Answers

data.table adds non-equi joins starting from v 1.9.8

library(data.table) #v>=1.9.8 setDT(sdata); setDT(fdata) # converting to data.table in place  fdata[sdata, on = .(fyear >= byear, fyear < eyear), nomatch = 0,       .(id, x.fyear, byear, eyear, val)] #    id x.fyear byear eyear val # 1:  1    1998  1995  2000   1 # 2:  2    1998  1995  2000   1 # 3:  3    1998  1995  2000   1 # 4:  5    1998  1995  2000   1 # 5:  8    1998  1995  2000   1 # 6: 13    1998  1995  2000   1 # 7:  1    1999  1995  2000   1 # 8:  2    1999  1995  2000   1 # 9:  3    1999  1995  2000   1 #10:  5    1999  1995  2000   1 #11:  8    1999  1995  2000   1 #12: 13    1999  1995  2000   1 #13:  1    2000  2000  2005   5 #14:  2    2000  2000  2005   5 #15:  3    2000  2000  2005   5 #16:  5    2000  2000  2005   5 #17:  8    2000  2000  2005   5 #18: 13    2000  2000  2005   5 #19:  1    2001  2000  2005   5 #20:  2    2001  2000  2005   5 #21:  3    2001  2000  2005   5 #22:  5    2001  2000  2005   5 #23:  8    2001  2000  2005   5 #24:  2    2002  2000  2005   5 #25:  3    2002  2000  2005   5 #26:  2    2003  2000  2005   5 #27:  3    2003  2000  2005   5 #    id x.fyear byear eyear val 

You can also get this to work with foverlaps in 1.9.6 with a little more effort.

like image 180
eddi Avatar answered Sep 28 '22 01:09

eddi


Use a filter. (But note that this answer does not produce a correct LEFT JOIN; but the MWE gives the right result with an INNER JOIN instead.)

The dplyr package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy:

fdata %>%      mutate(dummy=TRUE) %>%     left_join(sdata %>% mutate(dummy=TRUE)) %>%     filter(fyear >= byear, fyear < eyear) %>%     select(-dummy) 

And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy variable as evidenced by the following two query explanations:

> fdata %>%  +     mutate(dummy=TRUE) %>% +     left_join(sdata %>% mutate(dummy=TRUE)) %>% +     filter(fyear >= byear, fyear < eyear) %>% +     select(-dummy) %>% +     explain() Joining by: "dummy" <SQL> SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val" FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy" FROM "fdata") AS "zzz136"  LEFT JOIN   (SELECT "byear", "eyear", "val", TRUE AS "dummy" FROM "sdata") AS "zzz137"  USING ("dummy")) AS "zzz138" WHERE "fyear" >= "byear" AND "fyear" < "eyear"   <PLAN> Nested Loop  (cost=0.00..50886.88 rows=322722 width=40)   Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))   ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)   ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)         ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24) 

and doing it more cleanly with SQL gives exactly the same result:

> tbl(pg, sql(" +     SELECT * +     FROM fdata  +     LEFT JOIN sdata  +     ON fyear >= byear AND fyear < eyear")) %>% +     explain() <SQL> SELECT "id", "fyear", "byear", "eyear", "val" FROM (     SELECT *     FROM fdata      LEFT JOIN sdata      ON fyear >= byear AND fyear < eyear) AS "zzz140"   <PLAN> Nested Loop Left Join  (cost=0.00..50886.88 rows=322722 width=40)   Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))   ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)   ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)         ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24) 
like image 22
Ian Gow Avatar answered Sep 28 '22 00:09

Ian Gow