I've got two data.frames, one with event data and one with stock data of several companies (here it's only two). I want two additional columns with lagged dates (-1 day and +1 day) for both companies in my event data.frame. The lagged dates should come from my stock data.frame (df) of course. How can i do that?
DATE <- c("01.01.2000","02.01.2000","03.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000","01.01.2000","02.01.2000","04.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df <- data.frame(DATE, RET, COMP)
df
# DATE RET COMP
# 1 01.01.2000 -2.00 A
# 2 02.01.2000 1.10 A
# 3 03.01.2000 3.00 A
# 4 06.01.2000 1.40 A
# 5 07.01.2000 -0.20 A
# 6 09.01.2000 0.60 A
# 7 10.01.2000 0.10 A
# 8 01.01.2000 -0.21 B
# 9 02.01.2000 -1.20 B
# 10 04.01.2000 0.90 B
# 11 06.01.2000 0.30 B
# 12 07.01.2000 -0.10 B
# 13 09.01.2000 0.30 B
# 14 10.01.2000 -0.12 B
DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event <- data.frame(DATE, ARTICLE, COMP)
event
# DATE ARTICLE COMP
# 1 02.01.2000 blabla11 A
# 2 03.01.2000 blabla12 A
# 3 06.01.2000 blabla13 A
# 4 09.01.2000 blabla14 A
# 5 06.01.2000 blabla21 B
# 6 07.01.2000 blabla22 B
# 7 09.01.2000 blabla23 B
the output should be my data.frame event with the two additional columns DATEm1 and DATEp1
# DATE DATEm1 DATEp1 ARTICLE COMP
# 1 02.01.2000 01.01.2000 03.01.2000 blabla11 A
# 2 03.01.2000 02.01.2000 06.01.2000 blabla12 A
# 3 06.01.2000 03.01.2000 07.01.2000 blabla13 A
# 4 09.01.2000 07.01.2000 10.01.2000 blabla14 A
# 5 06.01.2000 04.01.2000 07.01.2000 blabla21 B
# 6 07.01.2000 06.01.2000 09.01.2000 blabla22 B
# 7 09.01.2000 07.01.2000 10.01.2000 blabla23 B
I have tried the approach in the answer of G. Grothendieck, which works perfectly for this example.
The problem is, my original data.frame contains way more data than this example and the sqldf approach is rather slow and uses a lot of memory (too much for my machine). Does anyone have another solution for this?
lag lag shifts the times one back. It does not change the values, only the times. Thus lag changes the tsp attribute from c(1, 4, 1) to c(0, 3, 1) . The start time is shifted from 1 to 0, the end time is shifted from 4 to 3 and since shifts do not change the frequency the frequency remains 1.
The opposite of lag() function is lead()
Lagged dependent variables (LDVs) have been used in regression analysis to provide robust estimates of the effects of independent variables, but some research argues that using LDVs in regressions produces negatively biased coefficient estimates, even if the LDV is part of the data-generating process.
I tried an approach that uses embed
and data.table
. Testing with the provided example data, it is competitive with the other data.table
approaches (see benchmarking below), but still a bit slower. The embed
approach might be faster when extended to additional lags, but I'm not sure if that's relevant.
Anyway, I put the (as of right now) answers together, and compared the timings and output. I don't know how much the exact output matters to you (e.g., I lost a bit of time on the benchmarking b/c I had to dump the RET column), but take note that the different answers vary slightly in the output format/ content. All approaches provide a result that is similar to your desired output format.
I wonder if the different methods scale differently for different sizes of data.frames ... If you test these, I'd be curious to know which is fastest for you and your data! :)
library("data.table")
library("sqldf")
library("microbenchmark")
# ========
# = Data =
# ========
DATE <- c("01.01.2000", "02.01.2000", "03.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000", "01.01.2000", "02.01.2000", "04.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df0 <- data.frame(DATE, RET, COMP)
DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event0 <- data.frame(DATE, ARTICLE, COMP)
# ==================
# = rbatt function =
# ==================
# Devations from desired format:
# 1) column order (COMP is first instead of last, otherwise correct order)
m2l <- function(x) split(x, rep(1:ncol(x), each = nrow(x))) # Thanks to https://stackoverflow.com/a/6823557/2343633
e2 <- function(x, d=1) m2l(rbind(matrix(NA, ncol=d, nrow=d-1), embed(x,d)))
testRB <- function(df=df0, event=event0){
dt1 <- as.data.table(df)
dt1[,DATE:=as.character(DATE)]
dt1[,c("DATEp1","DATE","DATEm1"):=e2(DATE,3),by=COMP]
dt1[,RET:=NULL]
setkey(dt1, COMP, DATE, DATEp1, DATEm1)
dt2 <- as.data.table(event)
dt2[,DATE:=as.character(DATE)]
setkey(dt2,COMP,DATE)
# below is slightly slower than doing dt1[,RET:=NULL] then dt <- dt1[dt2]
# dt <- dt1[dt2, list(DATEp1, DATEm1, ARTICLE)] # join
dt <- dt1[dt2]
dt
}
rbatt output:
# COMP DATE DATEp1 DATEm1 ARTICLE
#1: A 02.01.2000 03.01.2000 01.01.2000 blabla11
#2: A 03.01.2000 06.01.2000 02.01.2000 blabla12
#3: A 06.01.2000 07.01.2000 03.01.2000 blabla13
#4: A 09.01.2000 10.01.2000 07.01.2000 blabla14
#5: B 06.01.2000 07.01.2000 04.01.2000 blabla21
#6: B 07.01.2000 09.01.2000 06.01.2000 blabla22
#7: B 09.01.2000 10.01.2000 07.01.2000 blabla23
edited – DA optimization #1 (old code commented-out)
edited – DA optimization #2 (old code commented, versions labeled)
# ===========================
# = David Arenburg function =
# ===========================
# https://stackoverflow.com/a/23483865/2343633
# Devations from desired format:
# 1) column order
# 2) format of DATE, DATEm1, DATEp1
testDA <- function(df=df0, event=event0){
# Original DA below:
# df$DATE <- as.Date(strptime(as.character(df$DATE), format = "%m.%d.%Y"))
# event$DATE <- as.Date(strptime(as.character(event$DATE), format = "%m.%d.%Y"))
#
# ## Making sure "df" is sorted. If your data sets are already ordered you can skip the ordering both here and in the `setDT`
# df <- df[order(df$COMP, df$DATE), ]
#
# library(data.table)
# DT <- setDT(event)[order(COMP, DATE), list(
# DATEm1 = df[match(DATE, df$DATE) - 1, "DATE"],
# DATEp1 = df[match(DATE, df$DATE) + 1, "DATE"]
# ), by = c("ARTICLE", "DATE", "COMP")]
# DT
# Optimization #1:
# event$DATE <- as.character(event$DATE) # converting event$DATE to character (if it is already a character, better to skip this part)
# tempdf <- as.character(data.table(df, key = c("COMP", "DATE"))$DATE) # sorting and conerting df$DATE to character too so they will match
# setDT(event)[order(COMP, DATE), `:=` (
# DATEm1 = tempdf[match(DATE, tempdf) - 1],
# DATEp1 = tempdf[match(DATE, tempdf) + 1]
# ), by = c("DATE", "COMP")]
# event
# Optimization #2
# library(data.table) # loading data.table pckg
tempdf <- data.table(df, key = c("COMP", "DATE"))$DATE # sorting df and taking only the dates for speed
setDT(event)[order(COMP, DATE), `:=` (
DATEm1 = tempdf[match(DATE, tempdf) - 1],
DATEp1 = tempdf[match(DATE, tempdf) + 1]
)]
event
}
David Arenburg output:
edited for DA optimization #1 (#2 may be bugged)
note wrong content in row 7 column "DATEm1", month should be 04
# > testDA()
# DATE ARTICLE COMP DATEm1 DATEp1
# 1: 02.01.2000 blabla11 A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12 A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13 A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14 A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21 B 03.01.2000 07.01.2000
# 6: 07.01.2000 blabla22 B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23 B 07.01.2000 10.01.2000
# ============================
# = G. Grothendieck function =
# ============================
# https://stackoverflow.com/a/23415033/2343633
# Deviations from desired format:
# 1) format of DATE, DATEm1, DATEp1
testGG <- function(df=df0, event=event0){
# ensure that dates sort correctly by converting to yyyy-mm-dd
df2 <- transform(df, DATE = format(as.Date(DATE, "%m.%d.%Y")))
event2 <- transform(event, DATE = format(as.Date(DATE, "%m.%d.%Y")))
result <- sqldf(c("create index i on df2(COMP, DATE)",
"select
event.DATE,
max(A.DATE) DATEm1,
min(B.DATE) DATEp1,
event.ARTICLE,
event.COMP
from event2 event, main.df2 A, main.df2 B
on event.COMP = A.COMP and event.COMP = B.COMP
and event.DATE > A.DATE and event.DATE < B.DATE
group by event.DATE, event.COMP
order by event.COMP, event.DATE"))
result
}
GG output:
# DATE DATEm1 DATEp1 ARTICLE COMP
# 1 2000-02-01 2000-01-01 2000-03-01 blabla11 A
# 2 2000-03-01 2000-02-01 2000-06-01 blabla12 A
# 3 2000-06-01 2000-03-01 2000-07-01 blabla13 A
# 4 2000-09-01 2000-07-01 2000-10-01 blabla14 A
# 5 2000-06-01 2000-04-01 2000-07-01 blabla21 B
# 6 2000-07-01 2000-06-01 2000-09-01 blabla22 B
# 7 2000-09-01 2000-07-01 2000-10-01 blabla23 B
# =================
# = Arun function =
# =================
# https://stackoverflow.com/a/23484292/2343633
# Deviations from desired format:
# 1) Column order (COMP first, ARTICLE does not come after DATEm1)
testAR <- function(df=df0, event=event0){
dt1 = as.data.table(df)
dt2 = as.data.table(event)
key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)
idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L
idx1[idx1 == 0L] = NA
dt2[, `:=`(DATEm1 = dt1$DATE[idx1],
DATEp1 = dt1$DATE[idx2]
)]
dt2
}
Arun output:
# COMP DATE ARTICLE DATEm1 DATEp1
# 1: A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2: A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3: A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4: A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5: B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6: B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7: B 09.01.2000 blabla23 07.01.2000 10.01.2000
edit – note that this is the original benchmark (original code, original OP data set)
# =============
# = Benchmark =
# =============
microbenchmark(testAR(), testDA(), testRB(), testGG())
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.220278 3.414430 3.509251 3.626438 7.209494 100
# testDA() 4.273542 4.471227 4.569370 4.752857 6.460922 100
# testRB() 5.704559 5.981680 6.135946 6.457392 14.309858 100
# testGG() 22.337065 23.064494 23.964581 24.622467 50.934712 100
Note that I drop testGG()
from this benchmark b/c it was far slower (I did some tests on a couple intermediate data sets, and tetGG()
scaled worse than the other 3 approaches).
# ========
# = Data =
# ========
mos <- c("01","02","03","06","07","09","10", "01", "02", "04", "06", "07", "09", "10")
yrs <- 1920:2020
DATE <- paste(mos, "01", rep(yrs, each=length(mos)), sep=".")
RET <- rep(c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12), length(yrs))
COMP <- rep(c("A","A","A","A","A","A","A","B","B","B","B","B","B","B"), length(yrs))
df0 <- data.frame(DATE, RET, COMP)
mos2 <- c("02","03","06","09","06","07","09")
DATE <- paste(mos2, "01", rep(yrs, each=length(mos2)), sep=".")
ARTICLE <- rep(c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23"), length(yrs))
COMP <- rep(c("A","A","A","A","B","B","B"), length(yrs))
event0 <- data.frame(DATE, ARTICLE, COMP)
edit – original benchmarks for large dataset:
# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.458217 3.696698 3.934349 4.697033 6.584214 100
# testDA() 143.180409 148.916461 151.776002 155.219515 237.524369 100
# testRB() 7.279168 7.636102 8.073778 8.828537 11.143111 100
edit – benchmark for large dataset after DA optimization #1:
# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.198266 3.440739 3.605723 3.788199 22.52867 100
# testDA() 56.290346 59.528819 60.821921 64.580825 80.99480 100
# testRB() 6.763570 7.200741 7.400343 7.748849 20.97527 100
edit – benchmark for large data set after DA optimization #2:
NOTE – warning resulting from update #2 to testDA()
# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.423508 6.055584 6.246517 6.333444 7.653360 100
# testDA() 2.665558 3.961070 4.062354 4.139571 8.427439 100
# testRB() 6.421328 6.669137 6.877517 6.966977 8.271469 100
# There were 50 or more warnings (use warnings() to see the first 50)
# > warnings()[1]
# Warning message:
# In `[.data.table`(dt2, , `:=`(DATEm1 = dt1$DATE[idx1], ... :
# Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.
Profiling code
Rprof("testAR.out", memory.profiling=TRUE)
for(i in 1:50){
arAns <- testAR()
}
Rprof(NULL)
Rprof("testDA.out", memory.profiling=TRUE)
for(i in 1:50){
daAns <- testDA()
}
Rprof(NULL)
Rprof("testRB.out", memory.profiling=TRUE)
for(i in 1:50){
rbAns <- testRB()
}
Rprof(NULL)
testAR() profile results
# > summaryRprof("testAR.out", memory="both")$by.self
# self.time self.pct total.time total.pct mem.total
# "[[" 0.02 10 0.06 30 8.3
# "head" 0.02 10 0.04 20 12.1
# "nrow" 0.02 10 0.04 20 10.6
# ".Call" 0.02 10 0.02 10 8.2
# ".row_names_info" 0.02 10 0.02 10 8.4
# "<Anonymous>" 0.02 10 0.02 10 8.3
# "key" 0.02 10 0.02 10 0.0
# "levels.default" 0.02 10 0.02 10 0.0
# "match" 0.02 10 0.02 10 0.0
# "stopifnot" 0.02 10 0.02 10 4.2
testDA() profile results
# > summaryRprof("testDA.out", memory="both")$by.self
# self.time self.pct total.time total.pct mem.total
# "match" 2.04 26.56 2.34 30.47 94.2
# "[.data.frame" 1.78 23.18 6.50 84.64 295.3
# "NextMethod" 0.80 10.42 0.80 10.42 33.9
# "strptime" 0.42 5.47 0.46 5.99 25.9
# "[" 0.34 4.43 7.14 92.97 335.9
# "[.Date" 0.34 4.43 1.14 14.84 49.8
# "names" 0.34 4.43 0.34 4.43 17.9
# "%in%" 0.28 3.65 1.44 18.75 50.3
# "dim" 0.28 3.65 0.30 3.91 13.9
# "order" 0.16 2.08 0.18 2.34 1.7
# "$" 0.16 2.08 0.16 2.08 7.0
# ".Call" 0.14 1.82 6.76 88.02 308.4
# "length" 0.14 1.82 0.14 1.82 4.6
# "sys.call" 0.14 1.82 0.14 1.82 5.6
# "<Anonymous>" 0.04 0.52 0.04 0.52 9.5
# "as.Date.POSIXlt" 0.04 0.52 0.04 0.52 3.4
# "getwd" 0.04 0.52 0.04 0.52 9.5
# "do.call" 0.02 0.26 0.18 2.34 1.7
# "assign" 0.02 0.26 0.04 0.52 0.1
# ".subset2" 0.02 0.26 0.02 0.26 6.1
# "all" 0.02 0.26 0.02 0.26 0.0
# "file.info" 0.02 0.26 0.02 0.26 0.0
# "is.data.table" 0.02 0.26 0.02 0.26 0.0
# "lockBinding" 0.02 0.26 0.02 0.26 0.1
# "parent.frame" 0.02 0.26 0.02 0.26 0.0
# "pmatch" 0.02 0.26 0.02 0.26 0.0
# "which" 0.02 0.26 0.02 0.26 6.5
testRB() profile results
# > summaryRprof("testRB.out", memory="both")$by.self
# self.time self.pct total.time total.pct mem.total
# "sort.list" 0.04 9.52 0.06 14.29 21.5
# "length" 0.04 9.52 0.04 9.52 0.0
# "pmatch" 0.04 9.52 0.04 9.52 13.9
# "[.data.table" 0.02 4.76 0.42 100.00 71.8
# ".Call" 0.02 4.76 0.12 28.57 39.6
# "split.default" 0.02 4.76 0.10 23.81 32.9
# "alloc.col" 0.02 4.76 0.08 19.05 13.3
# "[[" 0.02 4.76 0.04 9.52 6.9
# "cedta" 0.02 4.76 0.04 9.52 0.0
# "lapply" 0.02 4.76 0.04 9.52 0.0
# "[[.data.frame" 0.02 4.76 0.02 4.76 6.9
# "as.character" 0.02 4.76 0.02 4.76 6.0
# "as.name" 0.02 4.76 0.02 4.76 5.3
# "attr" 0.02 4.76 0.02 4.76 0.0
# "exists" 0.02 4.76 0.02 4.76 0.0
# "FUN" 0.02 4.76 0.02 4.76 0.0
# "intersect" 0.02 4.76 0.02 4.76 6.5
# "is.data.table" 0.02 4.76 0.02 4.76 0.0
As far as I can tell, Arun's answer is the fastest and most memory efficient. rbatt answer scales better with data set size than does DA's answer – my initial guess was that approaches using POSIX or Date classes might not scale well, but I'm unsure if this hunch is supported by the profiling results. If someone thinks it would be helpful, I could provide the full profile results, instead of just the $by.self
portion.
Also worth noting is that time spent and memory used were negatively correlated among approaches – the fastest approaches used the least memory.
Here's another approach using data.table
:
First, we convert df
and event
to data.table
s. Here I'll use as.data.table(.)
. But you can use setDT
if you don't want to make a copy. That is, by doing setDT(df)
, df
will be modified by reference to a data.table
.
require(data.table) ## >= 1.9.2
dt1 = as.data.table(df)
dt2 = as.data.table(event)
Then we'll prepare the data as follows:
key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)
The setcolorder
rearranges the columns of your data.tables by reference. setkeyv
sorts the data.table by the given columns in ascending order, and marks the key columns for dt1
.
The column reordering is essential here because, we don't set key on dt2
(because that'll sort dt2
which may be undesirable for you). And since no key is set of dt2
, data.table takes the first 'n' (=2 here) columns from dt2
to match with the key columns from dt1
.
Note: A join
x[i]
using data.table absolutely requires key ofx
to be set. Herex = dt1
. Setting key oni
is optional, depending on if you wish the order to be preserved or not.
Now, we perform two joins and get the corresponding matching indices:
idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L
The first join gets for each match of dt2
in dt1
, the first matching position in dt1
. Similarly, the second join gets for each match of dt2
in dt1
, the last matching position in dt1
. We add -1 and +1 to get the previous and next indices respectively.
Take care of one special case:
idx1[idx1 == 0L] = NA
When the matching index is 1, subtracting it will result in 0. Because of R's behaviour on 0-index, we've to explicitly replace it with NA here.
Now, we can just subset those dates and add it to dt2
by reference as follows:
dt2[, `:=`(DATEm1 = dt1$DATE[idx1],
DATEp1 = dt1$DATE[idx2]
)]
# COMP DATE ARTICLE DATEm1 DATEp1
# 1: A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2: A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3: A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4: A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5: B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6: B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7: B 09.01.2000 blabla23 07.01.2000 10.01.2000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With