rolling joins data.table in R

Tags:

I am trying to understand a little more about the way rolling joins work and am having some confusion, I was hoping somebody could clarify this for me. To take a concrete example:

dt1 <- data.table(id=rep(1:5, 10), t=1:50, val1=1:50, key="id,t")
dt2 <- data.table(id=rep(1:5, 2), t=1:10, val2=1:10, key="id,t")

I expected this to produce a long data.table where the values in dt2 are rolled:

dt1[dt2,roll=TRUE]

Instead, the correct way to do this seems to be:

dt2[dt1,roll=TRUE]

Could someone explain to me more about how joining in data.table works as I am clearly not understanding it correctly. I thought that dt1[dt2,roll=TRUE] corresponded to the sql equivalent of select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t), except with the added functionality locf.

Additionally the documentation says:

X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) 
as an index.

This makes it seem that only things in X should be returned an the join being done is an inner join, not outer. What about in the case when roll=T but that particular id does not exist in dt1? Playing around a bit more I can't understand what value is being placed into the column.

218

asked Aug 19 '12 23:08

Alex

1 Answers

That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?

roll Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join.

rolltolast Like roll but the data is not rolled forward past the last observation within each group defined by the join columns. The value of i must fall in a gap in x but not after the end of the data, for that group defined by all but the last join column. roll and rolltolast may not both be TRUE.

In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax in base. That's quite a long answer so I won't paste it here.

198

answered Oct 23 '22 22:10

Matt Dowle

Related questions
                            
                                Extracting text data from PDF files
                            
                                Examples of the perils of globals in R and Stata
                            
                                Pretty ticks for log normal scale using ggplot2 (dynamic not manual)
                            
                                Vectorized IF statement in R?
                            
                                Hiding NA's when printing a dataframe in knitr
                            
                                Creating a sequential list of letters with R
                            
                                Calling R Function from C++
                            
                                Adding a company Logo to ShinyDashboard header
                            
                                How do I read a Parquet in R and convert it to an R DataFrame?
                            
                                Calculate cumsum() while ignoring NA values
                            
                                Random sample of character vector, without elements prefixing one another
                            
                                Create end of the month date from a date variable
                            
                                jupyter-client has to be installed but “jupyter kernelspec --version” exited with code 127
                            
                                dplyr: put count occurrences into new variable [duplicate]
                            
                                rename the columns name after cbind the data
                            
                                How I can select rows from a dataframe that do not match?
                            
                                filtering data.frame based on row_number()
                            
                                cor shows only NA or 1 for correlations - Why?
                            
                                Change the color and font of text in Shiny App
                            
                                Conflict with xlsx package and choose.files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

rolling joins data.table in R

Tags:

join

r

data.table

time-series

Alex

People also ask

1 Answers

Matt Dowle

Recent Activity

Donate For Us