Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join data.table on exact date or if not the case on the nearest less than date

Tags:

r

data.table

I would like to join two data.tables using the date as join.

Well , sometime i didn't have a exact match and in this case i would like to find the nearest less date. My probleme is very similar to this post about SQL : SQL Join on Nearest less than date

I know data.table syntax is analogous to SQL but I can't to code this. What is the correct syntax?

A simplified example :

Dt1 
   date      x
1/26/2010 - 10  
1/25/2010 - 9  
1/24/2010 - 9   
1/22/2010 - 7    
1/19/2010 - 11

Dt2
   date
1/26/2010   
1/23/2010   
1/20/2010  

output

   date     x
1/26/2010 - 10  
1/23/2010 - 7 
1/20/2010 - 11

Thank you in advance.

like image 372
mat Avatar asked Jul 05 '12 09:07

mat


2 Answers

Here you go:

library(data.table)

Create the data:

Dt1 <- read.table(text="
date      x
1/26/2010,  10  
1/25/2010,  9  
1/24/2010,  9   
1/22/2010,  7    
1/19/2010,  11", header=TRUE, stringsAsFactors=FALSE)

Dt2 <- read.table(text="
date
1/26/2010   
1/23/2010   
1/20/2010", header=TRUE, stringsAsFactors=FALSE)

Convert to data.table, convert strings to dates, and set the data.table key:

Dt1 <- data.table(Dt1)
Dt2 <- data.table(Dt2)

Dt1[, date:=as.Date(date, format=("%m/%d/%Y"))]
Dt2[, date:=as.Date(date, format=("%m/%d/%Y"))]

setkey(Dt1, date)
setkey(Dt2, date)

Join the tables, using roll=TRUE:

Dt1[Dt2, roll=TRUE]

           date  x
[1,] 2010-01-20 11
[2,] 2010-01-23  7
[3,] 2010-01-26 10
like image 131
Andrie Avatar answered Nov 15 '22 01:11

Andrie


?data.table                  # search for the `roll` argument
example(data.table)          # search for the example using roll=TRUE
vignette("datatable-intro")  # see section "3: Fast time series join" 
vignette("datatable-faq")    # see FAQs 2.16 and 2.20

This is one of the main features of data.table. Since rows are ordered (unlike SQL) this operation is simple and very fast. SQL is inherently unordered so you need a self join and 'order by' to do this task. It can be done in SQL and it works but it can be slow and needs more code. Since SQL is a row store, even in-memory SQL, it has a lower bound determined by page fetches from RAM into L2 cache. data.table is below that lower bound because it's a column store.

The 2 vignettes are also on the homepage.

like image 26
Matt Dowle Avatar answered Nov 14 '22 23:11

Matt Dowle