Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rolling joins data.table in R

I am trying to understand a little more about the way rolling joins work and am having some confusion, I was hoping somebody could clarify this for me. To take a concrete example:

dt1 <- data.table(id=rep(1:5, 10), t=1:50, val1=1:50, key="id,t")
dt2 <- data.table(id=rep(1:5, 2), t=1:10, val2=1:10, key="id,t")

I expected this to produce a long data.table where the values in dt2 are rolled:

dt1[dt2,roll=TRUE]

Instead, the correct way to do this seems to be:

dt2[dt1,roll=TRUE]

Could someone explain to me more about how joining in data.table works as I am clearly not understanding it correctly. I thought that dt1[dt2,roll=TRUE] corresponded to the sql equivalent of select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t), except with the added functionality locf.

Additionally the documentation says:

X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) 
as an index.

This makes it seem that only things in X should be returned an the join being done is an inner join, not outer. What about in the case when roll=T but that particular id does not exist in dt1? Playing around a bit more I can't understand what value is being placed into the column.

like image 218
Alex Avatar asked Aug 19 '12 23:08

Alex


People also ask

What is rolling join?

Enter the rolling join, a tool to join datasets based on any number of “fixed” columns (with the same principle as the natural join mechanism and its extensions) and a single, numerical “rolling” column.

How do you join data tables in R?

If you want to join by multiple variables, then you need to specify a vector of variable names: by = c("var1", "var2", "var3") . Here all three columns must match in both tables. If you want to use all variables that appear in both tables, then you can leave the by argument blank.

When I is a data table or character vector the columns to join by must be specified using?

table (or character vector), the columns to join by must be specified using 'on=' argument (see ? data. table), by keying x (i.e. sorted, and, marked as sorted, see ? setkey), or by sharing column names between x and i (i.e., a natural join).

What is an R data table joins?

R data.table Joins. Master operations between data.tables | by Scott Lyden | Analytics Vidhya | Medium Data.table is a powerful modern update of the venerable old data.frame. Under the hood, the package has been tuned for blazing speed and minimal memory usage with a syntax that is sleek and spare.

What is a rolling join in data analysis?

R – Data.Table Rolling Joins - GormAnalysis Rolling joins are commonly used for analyzing data involving time. A simple example – suppose you have a table of product sales and a table of commercials. You might want to associate each product sale with the most recent commercial that aired prior to the sale.

How to use rolling functions within a data table?

Thus, rolling functions can be used conveniently within data.table syntax. Argument n allows multiple values to apply rolling functions on multiple window sizes. If adaptive=TRUE, then n must be a list.

How do I set up a rolling join?

Before doing any rolling joins, I like to create a separate date/time column in each table to join on because one of the two tables loses it’s date/time field and I can never remember which. Next, set keys on each table. The last key column is the one the rolling join will “roll” on. We want to first join on , match website sessions to purchases.


1 Answers

That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?

roll Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join.

rolltolast Like roll but the data is not rolled forward past the last observation within each group defined by the join columns. The value of i must fall in a gap in x but not after the end of the data, for that group defined by all but the last join column. roll and rolltolast may not both be TRUE.

In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax in base. That's quite a long answer so I won't paste it here.

like image 198
Matt Dowle Avatar answered Oct 23 '22 22:10

Matt Dowle