Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create lagged vectors based on a different data.frame in a panel in R

Tags:

merge

r

lag

I've got two data.frames, one with event data and one with stock data of several companies (here it's only two). I want two additional columns with lagged dates (-1 day and +1 day) for both companies in my event data.frame. The lagged dates should come from my stock data.frame (df) of course. How can i do that?

DATE <- c("01.01.2000","02.01.2000","03.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000","01.01.2000","02.01.2000","04.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df <- data.frame(DATE, RET, COMP)

df

# DATE   RET COMP
# 1  01.01.2000 -2.00    A
# 2  02.01.2000  1.10    A
# 3  03.01.2000  3.00    A
# 4  06.01.2000  1.40    A
# 5  07.01.2000 -0.20    A
# 6  09.01.2000  0.60    A
# 7  10.01.2000  0.10    A
# 8  01.01.2000 -0.21    B
# 9  02.01.2000 -1.20    B
# 10 04.01.2000  0.90    B
# 11 06.01.2000  0.30    B
# 12 07.01.2000 -0.10    B
# 13 09.01.2000  0.30    B
# 14 10.01.2000 -0.12    B

DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")

event <- data.frame(DATE, ARTICLE, COMP)

event

#         DATE  ARTICLE COMP
# 1 02.01.2000 blabla11    A
# 2 03.01.2000 blabla12    A
# 3 06.01.2000 blabla13    A
# 4 09.01.2000 blabla14    A
# 5 06.01.2000 blabla21    B
# 6 07.01.2000 blabla22    B
# 7 09.01.2000 blabla23    B

the output should be my data.frame event with the two additional columns DATEm1 and DATEp1

#         DATE      DATEm1      DATEp1   ARTICLE COMP
# 1 02.01.2000  01.01.2000  03.01.2000  blabla11    A
# 2 03.01.2000  02.01.2000  06.01.2000  blabla12    A
# 3 06.01.2000  03.01.2000  07.01.2000  blabla13    A
# 4 09.01.2000  07.01.2000  10.01.2000  blabla14    A
# 5 06.01.2000  04.01.2000  07.01.2000  blabla21    B
# 6 07.01.2000  06.01.2000  09.01.2000  blabla22    B
# 7 09.01.2000  07.01.2000  10.01.2000  blabla23    B

I have tried the approach in the answer of G. Grothendieck, which works perfectly for this example.

The problem is, my original data.frame contains way more data than this example and the sqldf approach is rather slow and uses a lot of memory (too much for my machine). Does anyone have another solution for this?

like image 569
cptn Avatar asked May 01 '14 18:05

cptn


People also ask

What is LAG () in R?

lag lag shifts the times one back. It does not change the values, only the times. Thus lag changes the tsp attribute from c(1, 4, 1) to c(0, 3, 1) . The start time is shifted from 1 to 0, the end time is shifted from 4 to 3 and since shifts do not change the frequency the frequency remains 1.

What is the opposite of lag in R?

The opposite of lag() function is lead()

Why use lagged independent variables in regression?

Lagged dependent variables (LDVs) have been used in regression analysis to provide robust estimates of the effects of independent variables, but some research argues that using LDVs in regressions produces negatively biased coefficient estimates, even if the LDV is part of the data-generating process.


2 Answers

I tried an approach that uses embed and data.table. Testing with the provided example data, it is competitive with the other data.table approaches (see benchmarking below), but still a bit slower. The embed approach might be faster when extended to additional lags, but I'm not sure if that's relevant.

Anyway, I put the (as of right now) answers together, and compared the timings and output. I don't know how much the exact output matters to you (e.g., I lost a bit of time on the benchmarking b/c I had to dump the RET column), but take note that the different answers vary slightly in the output format/ content. All approaches provide a result that is similar to your desired output format.

I wonder if the different methods scale differently for different sizes of data.frames ... If you test these, I'd be curious to know which is fastest for you and your data! :)

Data and Libraries

library("data.table")
library("sqldf")
library("microbenchmark")

# ========
# = Data =
# ========
DATE <- c("01.01.2000", "02.01.2000", "03.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000", "01.01.2000", "02.01.2000", "04.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df0 <- data.frame(DATE, RET, COMP)

DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event0 <- data.frame(DATE, ARTICLE, COMP)

rbatt (this answer)

# ==================
# = rbatt function =
# ==================
# Devations from desired format: 
#  1) column order (COMP is first instead of last, otherwise correct order)
m2l <- function(x) split(x, rep(1:ncol(x), each = nrow(x))) # Thanks to https://stackoverflow.com/a/6823557/2343633
e2 <- function(x, d=1) m2l(rbind(matrix(NA, ncol=d, nrow=d-1), embed(x,d)))
testRB <- function(df=df0, event=event0){
    dt1 <- as.data.table(df)
    dt1[,DATE:=as.character(DATE)]
    dt1[,c("DATEp1","DATE","DATEm1"):=e2(DATE,3),by=COMP]
    dt1[,RET:=NULL]
    setkey(dt1, COMP, DATE, DATEp1, DATEm1)

    dt2 <- as.data.table(event)
    dt2[,DATE:=as.character(DATE)]
    setkey(dt2,COMP,DATE)

    # below is slightly slower than doing dt1[,RET:=NULL] then  dt <- dt1[dt2]
    # dt <- dt1[dt2, list(DATEp1, DATEm1, ARTICLE)] # join 

    dt <- dt1[dt2]
    dt
}

rbatt output:

#   COMP       DATE     DATEp1     DATEm1  ARTICLE
#1:    A 02.01.2000 03.01.2000 01.01.2000 blabla11
#2:    A 03.01.2000 06.01.2000 02.01.2000 blabla12
#3:    A 06.01.2000 07.01.2000 03.01.2000 blabla13
#4:    A 09.01.2000 10.01.2000 07.01.2000 blabla14
#5:    B 06.01.2000 07.01.2000 04.01.2000 blabla21
#6:    B 07.01.2000 09.01.2000 06.01.2000 blabla22
#7:    B 09.01.2000 10.01.2000 07.01.2000 blabla23

DA Answer

edited – DA optimization #1 (old code commented-out)

edited – DA optimization #2 (old code commented, versions labeled)

# ===========================
# = David Arenburg function =
# ===========================
# https://stackoverflow.com/a/23483865/2343633
# Devations from desired format:
#  1) column order

# 2) format of DATE, DATEm1, DATEp1

testDA <- function(df=df0, event=event0){
    # Original DA below:
    # df$DATE <- as.Date(strptime(as.character(df$DATE), format = "%m.%d.%Y"))
    # event$DATE <- as.Date(strptime(as.character(event$DATE), format = "%m.%d.%Y"))
    # 
    # ## Making sure "df" is sorted. If your data sets are already ordered you can skip the ordering both here and in the `setDT`
    # df <- df[order(df$COMP, df$DATE), ]
    # 
    # library(data.table)
    # DT <- setDT(event)[order(COMP, DATE), list(
    #                     DATEm1 = df[match(DATE, df$DATE) - 1, "DATE"], 
    #                     DATEp1 = df[match(DATE, df$DATE) + 1, "DATE"]
    #                     ), by = c("ARTICLE", "DATE", "COMP")]
    # DT

    # Optimization #1:
    # event$DATE <- as.character(event$DATE) # converting event$DATE to character (if it is already a character, better to skip this part)
    # tempdf <- as.character(data.table(df, key = c("COMP", "DATE"))$DATE) # sorting and conerting df$DATE to character too so they will match
    # setDT(event)[order(COMP, DATE), `:=` (
    #   DATEm1 = tempdf[match(DATE, tempdf) - 1], 
    #   DATEp1 = tempdf[match(DATE, tempdf) + 1]
    # ), by = c("DATE", "COMP")]
    # event

    # Optimization #2
    # library(data.table) # loading data.table pckg
    tempdf <- data.table(df, key = c("COMP", "DATE"))$DATE # sorting df and taking only the dates for speed
    setDT(event)[order(COMP, DATE), `:=` (
      DATEm1 = tempdf[match(DATE, tempdf) - 1], 
      DATEp1 = tempdf[match(DATE, tempdf) + 1]
    )]
    event
}

David Arenburg output:

edited for DA optimization #1 (#2 may be bugged)

note wrong content in row 7 column "DATEm1", month should be 04

# > testDA()
#          DATE  ARTICLE COMP     DATEm1     DATEp1
# 1: 02.01.2000 blabla11    A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12    A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13    A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14    A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21    B 03.01.2000 07.01.2000
# 6: 07.01.2000 blabla22    B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23    B 07.01.2000 10.01.2000

GG Answer

# ============================
# = G. Grothendieck function =
# ============================
# https://stackoverflow.com/a/23415033/2343633
# Deviations from desired format:
#  1) format of DATE, DATEm1, DATEp1
testGG <- function(df=df0, event=event0){
    # ensure that dates sort correctly by converting to yyyy-mm-dd
    df2 <- transform(df, DATE = format(as.Date(DATE, "%m.%d.%Y")))
    event2 <- transform(event, DATE = format(as.Date(DATE, "%m.%d.%Y")))

    result <- sqldf(c("create index i on df2(COMP, DATE)",
          "select 
              event.DATE, 
              max(A.DATE) DATEm1, 
              min(B.DATE) DATEp1, 
              event.ARTICLE, 
              event.COMP
           from event2 event, main.df2 A, main.df2 B 
           on event.COMP = A.COMP and event.COMP = B.COMP
              and event.DATE > A.DATE and event.DATE < B.DATE
           group by event.DATE, event.COMP
           order by event.COMP, event.DATE"))
    result
}

GG output:

#         DATE     DATEm1     DATEp1  ARTICLE COMP
# 1 2000-02-01 2000-01-01 2000-03-01 blabla11    A
# 2 2000-03-01 2000-02-01 2000-06-01 blabla12    A
# 3 2000-06-01 2000-03-01 2000-07-01 blabla13    A
# 4 2000-09-01 2000-07-01 2000-10-01 blabla14    A
# 5 2000-06-01 2000-04-01 2000-07-01 blabla21    B
# 6 2000-07-01 2000-06-01 2000-09-01 blabla22    B
# 7 2000-09-01 2000-07-01 2000-10-01 blabla23    B

Arun answer

# =================
# = Arun function =
# =================
# https://stackoverflow.com/a/23484292/2343633
# Deviations from desired format:
#  1) Column order (COMP first, ARTICLE does not come after DATEm1)
testAR <- function(df=df0, event=event0){
    dt1 = as.data.table(df)
    dt2 = as.data.table(event)

    key_cols = c("COMP", "DATE")
    setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
    setkeyv(dt1, key_cols)

    idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
    idx2 = dt1[dt2, which=TRUE, mult="last"]+1L

    idx1[idx1 == 0L] = NA

    dt2[, `:=`(DATEm1 = dt1$DATE[idx1], 
               DATEp1 = dt1$DATE[idx2]
      )]

    dt2
}

Arun output:

#    COMP       DATE  ARTICLE     DATEm1     DATEp1
# 1:    A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2:    A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3:    A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4:    A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5:    B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6:    B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7:    B 09.01.2000 blabla23 07.01.2000 10.01.2000

Benchmark

edit – note that this is the original benchmark (original code, original OP data set)

# =============
# = Benchmark =
# =============
microbenchmark(testAR(), testDA(), testRB(), testGG())

# Unit: milliseconds
#      expr       min        lq    median        uq       max neval
#  testAR()  3.220278  3.414430  3.509251  3.626438  7.209494   100
#  testDA()  4.273542  4.471227  4.569370  4.752857  6.460922   100
#  testRB()  5.704559  5.981680  6.135946  6.457392 14.309858   100
#  testGG() 22.337065 23.064494 23.964581 24.622467 50.934712   100

EDIT: Benchmark with larger data set

Note that I drop testGG() from this benchmark b/c it was far slower (I did some tests on a couple intermediate data sets, and tetGG() scaled worse than the other 3 approaches).

# ========
# = Data =
# ========
mos <- c("01","02","03","06","07","09","10", "01", "02", "04", "06", "07", "09", "10")
yrs <- 1920:2020
DATE <- paste(mos, "01", rep(yrs, each=length(mos)), sep=".")
RET <- rep(c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12), length(yrs))
COMP <- rep(c("A","A","A","A","A","A","A","B","B","B","B","B","B","B"), length(yrs))
df0 <- data.frame(DATE, RET, COMP)

mos2 <- c("02","03","06","09","06","07","09")
DATE <- paste(mos2, "01", rep(yrs, each=length(mos2)), sep=".")
ARTICLE <- rep(c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23"), length(yrs))
COMP <- rep(c("A","A","A","A","B","B","B"), length(yrs))
event0 <- data.frame(DATE, ARTICLE, COMP)

edit – original benchmarks for large dataset:

# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
#      expr        min         lq     median         uq        max neval
#  testAR()   3.458217   3.696698   3.934349   4.697033   6.584214   100
#  testDA() 143.180409 148.916461 151.776002 155.219515 237.524369   100
#  testRB()   7.279168   7.636102   8.073778   8.828537  11.143111   100

edit – benchmark for large dataset after DA optimization #1:

# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
#      expr       min        lq    median        uq      max neval
#  testAR()  3.198266  3.440739  3.605723  3.788199 22.52867   100
#  testDA() 56.290346 59.528819 60.821921 64.580825 80.99480   100
#  testRB()  6.763570  7.200741  7.400343  7.748849 20.97527   100

edit – benchmark for large data set after DA optimization #2:

NOTE – warning resulting from update #2 to testDA()

# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
#      expr      min       lq   median       uq      max neval
#  testAR() 3.423508 6.055584 6.246517 6.333444 7.653360   100
#  testDA() 2.665558 3.961070 4.062354 4.139571 8.427439   100
#  testRB() 6.421328 6.669137 6.877517 6.966977 8.271469   100
# There were 50 or more warnings (use warnings() to see the first 50)
# > warnings()[1]
# Warning message:
# In `[.data.table`(dt2, , `:=`(DATEm1 = dt1$DATE[idx1],  ... :
#   Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.

Memory and time profiling on large data set, 50 iterations each

Profiling code

Rprof("testAR.out", memory.profiling=TRUE)
for(i in 1:50){
    arAns <- testAR()
}
Rprof(NULL)

Rprof("testDA.out", memory.profiling=TRUE)
for(i in 1:50){
    daAns <- testDA()
}
Rprof(NULL)

Rprof("testRB.out", memory.profiling=TRUE)
for(i in 1:50){
    rbAns <- testRB()
}
Rprof(NULL)

testAR() profile results

# > summaryRprof("testAR.out", memory="both")$by.self
#                   self.time self.pct total.time total.pct mem.total
# "[["                   0.02       10       0.06        30       8.3
# "head"                 0.02       10       0.04        20      12.1
# "nrow"                 0.02       10       0.04        20      10.6
# ".Call"                0.02       10       0.02        10       8.2
# ".row_names_info"      0.02       10       0.02        10       8.4
# "<Anonymous>"          0.02       10       0.02        10       8.3
# "key"                  0.02       10       0.02        10       0.0
# "levels.default"       0.02       10       0.02        10       0.0
# "match"                0.02       10       0.02        10       0.0
# "stopifnot"            0.02       10       0.02        10       4.2

testDA() profile results

# > summaryRprof("testDA.out", memory="both")$by.self
#                   self.time self.pct total.time total.pct mem.total
# "match"                2.04    26.56       2.34     30.47      94.2
# "[.data.frame"         1.78    23.18       6.50     84.64     295.3
# "NextMethod"           0.80    10.42       0.80     10.42      33.9
# "strptime"             0.42     5.47       0.46      5.99      25.9
# "["                    0.34     4.43       7.14     92.97     335.9
# "[.Date"               0.34     4.43       1.14     14.84      49.8
# "names"                0.34     4.43       0.34      4.43      17.9
# "%in%"                 0.28     3.65       1.44     18.75      50.3
# "dim"                  0.28     3.65       0.30      3.91      13.9
# "order"                0.16     2.08       0.18      2.34       1.7
# "$"                    0.16     2.08       0.16      2.08       7.0
# ".Call"                0.14     1.82       6.76     88.02     308.4
# "length"               0.14     1.82       0.14      1.82       4.6
# "sys.call"             0.14     1.82       0.14      1.82       5.6
# "<Anonymous>"          0.04     0.52       0.04      0.52       9.5
# "as.Date.POSIXlt"      0.04     0.52       0.04      0.52       3.4
# "getwd"                0.04     0.52       0.04      0.52       9.5
# "do.call"              0.02     0.26       0.18      2.34       1.7
# "assign"               0.02     0.26       0.04      0.52       0.1
# ".subset2"             0.02     0.26       0.02      0.26       6.1
# "all"                  0.02     0.26       0.02      0.26       0.0
# "file.info"            0.02     0.26       0.02      0.26       0.0
# "is.data.table"        0.02     0.26       0.02      0.26       0.0
# "lockBinding"          0.02     0.26       0.02      0.26       0.1
# "parent.frame"         0.02     0.26       0.02      0.26       0.0
# "pmatch"               0.02     0.26       0.02      0.26       0.0
# "which"                0.02     0.26       0.02      0.26       6.5

testRB() profile results

# > summaryRprof("testRB.out", memory="both")$by.self
#                 self.time self.pct total.time total.pct mem.total
# "sort.list"          0.04     9.52       0.06     14.29      21.5
# "length"             0.04     9.52       0.04      9.52       0.0
# "pmatch"             0.04     9.52       0.04      9.52      13.9
# "[.data.table"       0.02     4.76       0.42    100.00      71.8
# ".Call"              0.02     4.76       0.12     28.57      39.6
# "split.default"      0.02     4.76       0.10     23.81      32.9
# "alloc.col"          0.02     4.76       0.08     19.05      13.3
# "[["                 0.02     4.76       0.04      9.52       6.9
# "cedta"              0.02     4.76       0.04      9.52       0.0
# "lapply"             0.02     4.76       0.04      9.52       0.0
# "[[.data.frame"      0.02     4.76       0.02      4.76       6.9
# "as.character"       0.02     4.76       0.02      4.76       6.0
# "as.name"            0.02     4.76       0.02      4.76       5.3
# "attr"               0.02     4.76       0.02      4.76       0.0
# "exists"             0.02     4.76       0.02      4.76       0.0
# "FUN"                0.02     4.76       0.02      4.76       0.0
# "intersect"          0.02     4.76       0.02      4.76       6.5
# "is.data.table"      0.02     4.76       0.02      4.76       0.0

Conclusion

As far as I can tell, Arun's answer is the fastest and most memory efficient. rbatt answer scales better with data set size than does DA's answer – my initial guess was that approaches using POSIX or Date classes might not scale well, but I'm unsure if this hunch is supported by the profiling results. If someone thinks it would be helpful, I could provide the full profile results, instead of just the $by.self portion.

Also worth noting is that time spent and memory used were negatively correlated among approaches – the fastest approaches used the least memory.

like image 112
rbatt Avatar answered Sep 22 '22 03:09

rbatt


Here's another approach using data.table:

First, we convert df and event to data.tables. Here I'll use as.data.table(.). But you can use setDT if you don't want to make a copy. That is, by doing setDT(df), df will be modified by reference to a data.table.

require(data.table) ## >= 1.9.2
dt1 = as.data.table(df)
dt2 = as.data.table(event)

Then we'll prepare the data as follows:

key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)

The setcolorder rearranges the columns of your data.tables by reference. setkeyv sorts the data.table by the given columns in ascending order, and marks the key columns for dt1.

The column reordering is essential here because, we don't set key on dt2 (because that'll sort dt2 which may be undesirable for you). And since no key is set of dt2, data.table takes the first 'n' (=2 here) columns from dt2 to match with the key columns from dt1.

Note: A join x[i] using data.table absolutely requires key of x to be set. Here x = dt1. Setting key on i is optional, depending on if you wish the order to be preserved or not.

Now, we perform two joins and get the corresponding matching indices:

idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L

The first join gets for each match of dt2 in dt1, the first matching position in dt1. Similarly, the second join gets for each match of dt2 in dt1, the last matching position in dt1. We add -1 and +1 to get the previous and next indices respectively.

Take care of one special case:

idx1[idx1 == 0L] = NA

When the matching index is 1, subtracting it will result in 0. Because of R's behaviour on 0-index, we've to explicitly replace it with NA here.

Now, we can just subset those dates and add it to dt2 by reference as follows:

dt2[, `:=`(DATEm1 = dt1$DATE[idx1], 
           DATEp1 = dt1$DATE[idx2]
  )]

#    COMP       DATE  ARTICLE     DATEm1     DATEp1
# 1:    A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2:    A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3:    A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4:    A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5:    B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6:    B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7:    B 09.01.2000 blabla23 07.01.2000 10.01.2000
like image 36
Arun Avatar answered Sep 21 '22 03:09

Arun