dplyr left_join by less than, greater than condition

Tags:

This question is somewhat related to issues Efficiently merging two data frames on a non-trivial criteria and Checking if date is between two dates in r. And the one I have posted here requesting if the feature exist: GitHub issue

I am looking to join two dataframes using dplyr::left_join(). The condition I use to join is less-than, greater-than i.e, <= and >. Does dplyr::left_join() support this feature? or do the keys only take = operator between them. This is straightforward to run from SQL (assuming I have the dataframe in the database)

Here is a MWE: I have two datasets one firm-year (fdata), while second is sort of survey data that happens once every five years. So for all years in the fdata that are in between two survey years, I join the corresponding survey year data.

id <- c(1,1,1,1,         2,2,2,2,2,2,         3,3,3,3,3,3,         5,5,5,5,         8,8,8,8,         13,13,13)  fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,        1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,        1998,1999,2000,2001,1998,1999,2000)  byear <- c(1990,1995,2000,2005) eyear <- c(1995,2000,2005,2010) val <- c(3,1,5,6)  sdata <- tbl_df(data.frame(byear, eyear, val))  fdata <- tbl_df(data.frame(id, fyear))  test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))

I get

Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds

Unless if left_join can handle the condition, but my syntax is missing something?

730

asked May 18 '16 02:05

rajvijay

2 Answers

data.table adds non-equi joins starting from v 1.9.8

library(data.table) #v>=1.9.8 setDT(sdata); setDT(fdata) # converting to data.table in place  fdata[sdata, on = .(fyear >= byear, fyear < eyear), nomatch = 0,       .(id, x.fyear, byear, eyear, val)] #    id x.fyear byear eyear val # 1:  1    1998  1995  2000   1 # 2:  2    1998  1995  2000   1 # 3:  3    1998  1995  2000   1 # 4:  5    1998  1995  2000   1 # 5:  8    1998  1995  2000   1 # 6: 13    1998  1995  2000   1 # 7:  1    1999  1995  2000   1 # 8:  2    1999  1995  2000   1 # 9:  3    1999  1995  2000   1 #10:  5    1999  1995  2000   1 #11:  8    1999  1995  2000   1 #12: 13    1999  1995  2000   1 #13:  1    2000  2000  2005   5 #14:  2    2000  2000  2005   5 #15:  3    2000  2000  2005   5 #16:  5    2000  2000  2005   5 #17:  8    2000  2000  2005   5 #18: 13    2000  2000  2005   5 #19:  1    2001  2000  2005   5 #20:  2    2001  2000  2005   5 #21:  3    2001  2000  2005   5 #22:  5    2001  2000  2005   5 #23:  8    2001  2000  2005   5 #24:  2    2002  2000  2005   5 #25:  3    2002  2000  2005   5 #26:  2    2003  2000  2005   5 #27:  3    2003  2000  2005   5 #    id x.fyear byear eyear val

You can also get this to work with foverlaps in 1.9.6 with a little more effort.

180

answered Sep 28 '22 01:09

eddi

Use a filter. (But note that this answer does not produce a correct LEFT JOIN; but the MWE gives the right result with an INNER JOIN instead.)

The dplyr package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy:

fdata %>%      mutate(dummy=TRUE) %>%     left_join(sdata %>% mutate(dummy=TRUE)) %>%     filter(fyear >= byear, fyear < eyear) %>%     select(-dummy)

And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy variable as evidenced by the following two query explanations:

> fdata %>%  +     mutate(dummy=TRUE) %>% +     left_join(sdata %>% mutate(dummy=TRUE)) %>% +     filter(fyear >= byear, fyear < eyear) %>% +     select(-dummy) %>% +     explain() Joining by: "dummy" <SQL> SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val" FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy" FROM "fdata") AS "zzz136"  LEFT JOIN   (SELECT "byear", "eyear", "val", TRUE AS "dummy" FROM "sdata") AS "zzz137"  USING ("dummy")) AS "zzz138" WHERE "fyear" >= "byear" AND "fyear" < "eyear"   <PLAN> Nested Loop  (cost=0.00..50886.88 rows=322722 width=40)   Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))   ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)   ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)         ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

and doing it more cleanly with SQL gives exactly the same result:

> tbl(pg, sql(" +     SELECT * +     FROM fdata  +     LEFT JOIN sdata  +     ON fyear >= byear AND fyear < eyear")) %>% +     explain() <SQL> SELECT "id", "fyear", "byear", "eyear", "val" FROM (     SELECT *     FROM fdata      LEFT JOIN sdata      ON fyear >= byear AND fyear < eyear) AS "zzz140"   <PLAN> Nested Loop Left Join  (cost=0.00..50886.88 rows=322722 width=40)   Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))   ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)   ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)         ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

answered Sep 28 '22 00:09

Ian Gow

Related questions
                            
                                Get current year in TSQL
                            
                                Count multiple columns with group by in one query
                            
                                Detect consecutive dates ranges using SQL
                            
                                SQL Server triggers - order of execution
                            
                                Cast collation of nvarchar variables in t-sql
                            
                                Using insert into ... select results in a incorrect syntax near select, why?
                            
                                Return zero if no record is found
                            
                                ORA-01843 not a valid month- Comparing Dates
                            
                                Update only time from my Datetime field in sql
                            
                                Rails joins through association
                            
                                How to aggregate boolean column
                            
                                SQL Server CASE .. WHEN .. IN statement
                            
                                Database efficiency - table per user vs. table of users
                            
                                How to check correctly if a temporary table exists in SQL Server 2005?
                            
                                Conditional UPDATE in MySQL
                            
                                Truncate table in Oracle getting errors
                            
                                There are no Primary or Candidate Keys in the referenced table
                            
                                meta_query, how to search using both relation OR & AND?
                            
                                What are indexes and how can I use them to optimize queries in my database? [duplicate]
                            
                                TSQL left join and only last row from right

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

dplyr left_join by less than, greater than condition

Tags:

sql

r

r-faq

left-join

dplyr

rajvijay

People also ask

2 Answers

eddi

Ian Gow

Recent Activity

Donate For Us