Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add column by reference on rolling join

Tags:

r

data.table

The main problem is described in the question title. Going straight to the example below.

I have two datasets:

library(data.table)
dt1 <- data.table(date = as.Date("2015-06-28")+c(0L,3L,5L,7L),
                  key="date")
dt2 <- data.table(date = as.Date("2015-06-30")+c(0:1,4L),
                  val = letters[7:9],
                  dummy = rep(NA,3),
                  key="date")

I want to have val column from dt2 added to dt1 using rolling join.
The following statement will produce similar output to the expected one:

dt2[dt1, roll=TRUE]
#          date val dummy
# 1: 2015-06-28  NA    NA
# 2: 2015-07-01   h    NA
# 3: 2015-07-03   h    NA
# 4: 2015-07-05   i    NA

The are two problems with that statement:
1. I didn't want to have dummy column
2. I want to do this by reference:

address(dt1)
# [1] "0x3b57540"
address(dt2[dt1, roll=TRUE])
# [1] "0x3b4e1f0"

So, I'm looking for rolling join and add column by reference for my dt1 and dt2, expected output:

#          date val
# 1: 2015-06-28  NA
# 2: 2015-07-01   h
# 3: 2015-07-03   h
# 4: 2015-07-05   i

And of course address(dt1) should match to address of the magic statement.

like image 925
jangorecki Avatar asked Sep 28 '22 01:09

jangorecki


2 Answers

Here's a method that should scale.

address(dt1)
# [1] "0x265a060"
ix = dt2[dt1, roll=TRUE, which=TRUE]
dt1[, val := dt2[ix, val]]
dt1
#          date val
# 1: 2015-06-28  NA
# 2: 2015-07-01   h
# 3: 2015-07-03   h
# 4: 2015-07-05   i
address(dt1)
# [1] "0x265a060"

Or without creating the index, directly as:

dt1[, val := dt2[dt1, val, roll = TRUE]] ## (1)

Note that this is more efficient than doing:

dt1[, val := dt2[dt1, roll = TRUE]$val] ## (2)

(2) performs the entire join (materialises all the columns) and then extracts val, where as (1) directly just extracts column val.

like image 134
Clayton Stanley Avatar answered Oct 20 '22 07:10

Clayton Stanley


Here's a workaround; not perfect because it still involves creating extra memory, though there may be a way to avoid this that's not coming to me right away (anyway the extra memory may be minimal):

> address(dt1)
[1] "0x57b5230"

rng<-range(dt1[,range(date)],dt2[,range(date)])

x<-data.table(date=seq(from=rng[1],to=rng[2],by="day"),
              key="date")

> address(x)
[1] "0x6aa2df0"

x[dt2,setdiff(names(dt2),"date"):=mget(setdiff(names(dt2),"date"))
  ][,val:=zoo::na.locf(val,na.rm=F)]

> address(x)
[1] "0x6aa2df0"

> dt1[x,val:=i.val][]
         date val
1: 2015-06-28  NA
2: 2015-07-01   h
3: 2015-07-03   h
4: 2015-07-05   i

> address(dt1)
[1] "0x57b5230"
like image 45
MichaelChirico Avatar answered Oct 20 '22 07:10

MichaelChirico