I've got two data.frames, one with event data and one with stock data of several companies (here it's only two). I want two additional columns with lagged dates (-1 day and +1 day) for both companies in my event data.frame. The lagged dates should come from my stock data.frame (df) of course. How can i do that? <pre class="prettyprint"><code>DATE <- c("01.01.2000","02.01.2000","03.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000","01.01.2000","02.01.2000","04.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000") RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12) COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B") df <- data.frame(DATE, RET, COMP) df # DATE RET COMP # 1 01.01.2000 -2.00 A # 2 02.01.2000 1.10 A # 3 03.01.2000 3.00 A # 4 06.01.2000 1.40 A # 5 07.01.2000 -0.20 A # 6 09.01.2000 0.60 A # 7 10.01.2000 0.10 A # 8 01.01.2000 -0.21 B # 9 02.01.2000 -1.20 B # 10 04.01.2000 0.90 B # 11 06.01.2000 0.30 B # 12 07.01.2000 -0.10 B # 13 09.01.2000 0.30 B # 14 10.01.2000 -0.12 B DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000") ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23") COMP <- c("A","A","A","A","B","B","B") event <- data.frame(DATE, ARTICLE, COMP) event # DATE ARTICLE COMP # 1 02.01.2000 blabla11 A # 2 03.01.2000 blabla12 A # 3 06.01.2000 blabla13 A # 4 09.01.2000 blabla14 A # 5 06.01.2000 blabla21 B # 6 07.01.2000 blabla22 B # 7 09.01.2000 blabla23 B </code></pre> the output should be my data.frame event with the two additional columns DATEm1 and DATEp1 <pre class="prettyprint"><code># DATE DATEm1 DATEp1 ARTICLE COMP # 1 02.01.2000 01.01.2000 03.01.2000 blabla11 A # 2 03.01.2000 02.01.2000 06.01.2000 blabla12 A # 3 06.01.2000 03.01.2000 07.01.2000 blabla13 A # 4 09.01.2000 07.01.2000 10.01.2000 blabla14 A # 5 06.01.2000 04.01.2000 07.01.2000 blabla21 B # 6 07.01.2000 06.01.2000 09.01.2000 blabla22 B # 7 09.01.2000 07.01.2000 10.01.2000 blabla23 B </code></pre> I have tried the approach in the answer of G. Grothendieck, which works perfectly for this example. The problem is, my original data.frame contains way more data than this example and the sqldf approach is rather slow and uses a lot of memory (too much for my machine). Does anyone have another solution for this?

Here's another approach using <code>data.table</code>: First, we convert <code>df</code> and <code>event</code> to <code>data.table</code>s. Here I'll use <code>as.data.table(.)</code>. But you can use <code>setDT</code> if you don't want to make a copy. That is, by doing <code>setDT(df)</code>, <code>df</code> will be modified by reference to a <code>data.table</code>. <pre class="prettyprint"><code>require(data.table) ## >= 1.9.2 dt1 = as.data.table(df) dt2 = as.data.table(event) </code></pre> Then we'll prepare the data as follows: <pre class="prettyprint"><code>key_cols = c("COMP", "DATE") setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols))) setkeyv(dt1, key_cols) </code></pre> The <code>setcolorder</code> rearranges the columns of your data.tables by reference. <code>setkeyv</code> sorts the data.table by the given columns in ascending order, and marks the key columns for <code>dt1</code>. The column reordering is essential here because, we don't set key on <code>dt2</code> (because that'll sort <code>dt2</code> which may be undesirable for you). And since no key is set of <code>dt2</code>, data.table takes the first 'n' (=2 here) columns from <code>dt2</code> to match with the key columns from <code>dt1</code>. <blockquote> Note: A join <code>x[i]</code> using data.table absolutely requires key of <code>x</code> to be set. Here <code>x = dt1</code>. Setting key on <code>i</code> is optional, depending on if you wish the order to be preserved or not. </blockquote> Now, we perform two joins and get the corresponding matching indices: <pre class="prettyprint"><code>idx1 = dt1[dt2, which=TRUE, mult="first"]-1L idx2 = dt1[dt2, which=TRUE, mult="last"]+1L </code></pre> The first join gets for each match of <code>dt2</code> in <code>dt1</code>, the first matching position in <code>dt1</code>. Similarly, the second join gets for each match of <code>dt2</code> in <code>dt1</code>, the last matching position in <code>dt1</code>. We add -1 and +1 to get the previous and next indices respectively. Take care of one special case: <pre class="prettyprint"><code>idx1[idx1 == 0L] = NA </code></pre> When the matching index is 1, subtracting it will result in 0. Because of R's behaviour on 0-index, we've to explicitly replace it with NA here. Now, we can just subset those dates and add it to <code>dt2</code> by reference as follows: <pre class="prettyprint"><code>dt2[, `:=`(DATEm1 = dt1$DATE[idx1], DATEp1 = dt1$DATE[idx2] )] # COMP DATE ARTICLE DATEm1 DATEp1 # 1: A 02.01.2000 blabla11 01.01.2000 03.01.2000 # 2: A 03.01.2000 blabla12 02.01.2000 06.01.2000 # 3: A 06.01.2000 blabla13 03.01.2000 07.01.2000 # 4: A 09.01.2000 blabla14 07.01.2000 10.01.2000 # 5: B 06.01.2000 blabla21 04.01.2000 07.01.2000 # 6: B 07.01.2000 blabla22 06.01.2000 09.01.2000 # 7: B 09.01.2000 blabla23 07.01.2000 10.01.2000 </code></pre>

Create lagged vectors based on a different data.frame in a panel in R

Tags:

merge

r

lag

I've got two data.frames, one with event data and one with stock data of several companies (here it's only two). I want two additional columns with lagged dates (-1 day and +1 day) for both companies in my event data.frame. The lagged dates should come from my stock data.frame (df) of course. How can i do that?

DATE <- c("01.01.2000","02.01.2000","03.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000","01.01.2000","02.01.2000","04.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df <- data.frame(DATE, RET, COMP)

df

# DATE   RET COMP
# 1  01.01.2000 -2.00    A
# 2  02.01.2000  1.10    A
# 3  03.01.2000  3.00    A
# 4  06.01.2000  1.40    A
# 5  07.01.2000 -0.20    A
# 6  09.01.2000  0.60    A
# 7  10.01.2000  0.10    A
# 8  01.01.2000 -0.21    B
# 9  02.01.2000 -1.20    B
# 10 04.01.2000  0.90    B
# 11 06.01.2000  0.30    B
# 12 07.01.2000 -0.10    B
# 13 09.01.2000  0.30    B
# 14 10.01.2000 -0.12    B

DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")

event <- data.frame(DATE, ARTICLE, COMP)

event

#         DATE  ARTICLE COMP
# 1 02.01.2000 blabla11    A
# 2 03.01.2000 blabla12    A
# 3 06.01.2000 blabla13    A
# 4 09.01.2000 blabla14    A
# 5 06.01.2000 blabla21    B
# 6 07.01.2000 blabla22    B
# 7 09.01.2000 blabla23    B

the output should be my data.frame event with the two additional columns DATEm1 and DATEp1

#         DATE      DATEm1      DATEp1   ARTICLE COMP
# 1 02.01.2000  01.01.2000  03.01.2000  blabla11    A
# 2 03.01.2000  02.01.2000  06.01.2000  blabla12    A
# 3 06.01.2000  03.01.2000  07.01.2000  blabla13    A
# 4 09.01.2000  07.01.2000  10.01.2000  blabla14    A
# 5 06.01.2000  04.01.2000  07.01.2000  blabla21    B
# 6 07.01.2000  06.01.2000  09.01.2000  blabla22    B
# 7 09.01.2000  07.01.2000  10.01.2000  blabla23    B

I have tried the approach in the answer of G. Grothendieck, which works perfectly for this example.

The problem is, my original data.frame contains way more data than this example and the sqldf approach is rather slow and uses a lot of memory (too much for my machine). Does anyone have another solution for this?

569

asked May 01 '14 18:05

cptn

2 Answers

I tried an approach that uses embed and data.table. Testing with the provided example data, it is competitive with the other data.table approaches (see benchmarking below), but still a bit slower. The embed approach might be faster when extended to additional lags, but I'm not sure if that's relevant.

Anyway, I put the (as of right now) answers together, and compared the timings and output. I don't know how much the exact output matters to you (e.g., I lost a bit of time on the benchmarking b/c I had to dump the RET column), but take note that the different answers vary slightly in the output format/ content. All approaches provide a result that is similar to your desired output format.

I wonder if the different methods scale differently for different sizes of data.frames ... If you test these, I'd be curious to know which is fastest for you and your data! :)

Data and Libraries

library("data.table")
library("sqldf")
library("microbenchmark")

# ========
# = Data =
# ========
DATE <- c("01.01.2000", "02.01.2000", "03.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000", "01.01.2000", "02.01.2000", "04.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df0 <- data.frame(DATE, RET, COMP)

DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event0 <- data.frame(DATE, ARTICLE, COMP)

rbatt (this answer)

# ==================
# = rbatt function =
# ==================
# Devations from desired format: 
#  1) column order (COMP is first instead of last, otherwise correct order)
m2l <- function(x) split(x, rep(1:ncol(x), each = nrow(x))) # Thanks to https://stackoverflow.com/a/6823557/2343633
e2 <- function(x, d=1) m2l(rbind(matrix(NA, ncol=d, nrow=d-1), embed(x,d)))
testRB <- function(df=df0, event=event0){
    dt1 <- as.data.table(df)
    dt1[,DATE:=as.character(DATE)]
    dt1[,c("DATEp1","DATE","DATEm1"):=e2(DATE,3),by=COMP]
    dt1[,RET:=NULL]
    setkey(dt1, COMP, DATE, DATEp1, DATEm1)

    dt2 <- as.data.table(event)
    dt2[,DATE:=as.character(DATE)]
    setkey(dt2,COMP,DATE)

    # below is slightly slower than doing dt1[,RET:=NULL] then  dt <- dt1[dt2]
    # dt <- dt1[dt2, list(DATEp1, DATEm1, ARTICLE)] # join 

    dt <- dt1[dt2]
    dt
}

rbatt output:

#   COMP       DATE     DATEp1     DATEm1  ARTICLE
#1:    A 02.01.2000 03.01.2000 01.01.2000 blabla11
#2:    A 03.01.2000 06.01.2000 02.01.2000 blabla12
#3:    A 06.01.2000 07.01.2000 03.01.2000 blabla13
#4:    A 09.01.2000 10.01.2000 07.01.2000 blabla14
#5:    B 06.01.2000 07.01.2000 04.01.2000 blabla21
#6:    B 07.01.2000 09.01.2000 06.01.2000 blabla22
#7:    B 09.01.2000 10.01.2000 07.01.2000 blabla23

DA Answer

edited – DA optimization #1 (old code commented-out)

edited – DA optimization #2 (old code commented, versions labeled)

# ===========================
# = David Arenburg function =
# ===========================
# https://stackoverflow.com/a/23483865/2343633
# Devations from desired format:
#  1) column order

~~# 2) format of DATE, DATEm1, DATEp1~~

testDA <- function(df=df0, event=event0){
    # Original DA below:
    # df$DATE <- as.Date(strptime(as.character(df$DATE), format = "%m.%d.%Y"))
    # event$DATE <- as.Date(strptime(as.character(event$DATE), format = "%m.%d.%Y"))
    # 
    # ## Making sure "df" is sorted. If your data sets are already ordered you can skip the ordering both here and in the `setDT`
    # df <- df[order(df$COMP, df$DATE), ]
    # 
    # library(data.table)
    # DT <- setDT(event)[order(COMP, DATE), list(
    #                     DATEm1 = df[match(DATE, df$DATE) - 1, "DATE"], 
    #                     DATEp1 = df[match(DATE, df$DATE) + 1, "DATE"]
    #                     ), by = c("ARTICLE", "DATE", "COMP")]
    # DT

    # Optimization #1:
    # event$DATE <- as.character(event$DATE) # converting event$DATE to character (if it is already a character, better to skip this part)
    # tempdf <- as.character(data.table(df, key = c("COMP", "DATE"))$DATE) # sorting and conerting df$DATE to character too so they will match
    # setDT(event)[order(COMP, DATE), `:=` (
    #   DATEm1 = tempdf[match(DATE, tempdf) - 1], 
    #   DATEp1 = tempdf[match(DATE, tempdf) + 1]
    # ), by = c("DATE", "COMP")]
    # event

    # Optimization #2
    # library(data.table) # loading data.table pckg
    tempdf <- data.table(df, key = c("COMP", "DATE"))$DATE # sorting df and taking only the dates for speed
    setDT(event)[order(COMP, DATE), `:=` (
      DATEm1 = tempdf[match(DATE, tempdf) - 1], 
      DATEp1 = tempdf[match(DATE, tempdf) + 1]
    )]
    event
}

David Arenburg output:

edited for DA optimization #1 (#2 may be bugged)

note wrong content in row 7 column "DATEm1", month should be 04

# > testDA()
#          DATE  ARTICLE COMP     DATEm1     DATEp1
# 1: 02.01.2000 blabla11    A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12    A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13    A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14    A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21    B 03.01.2000 07.01.2000
# 6: 07.01.2000 blabla22    B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23    B 07.01.2000 10.01.2000

GG Answer

# ============================
# = G. Grothendieck function =
# ============================
# https://stackoverflow.com/a/23415033/2343633
# Deviations from desired format:
#  1) format of DATE, DATEm1, DATEp1
testGG <- function(df=df0, event=event0){
    # ensure that dates sort correctly by converting to yyyy-mm-dd
    df2 <- transform(df, DATE = format(as.Date(DATE, "%m.%d.%Y")))
    event2 <- transform(event, DATE = format(as.Date(DATE, "%m.%d.%Y")))

    result <- sqldf(c("create index i on df2(COMP, DATE)",
          "select 
              event.DATE, 
              max(A.DATE) DATEm1, 
              min(B.DATE) DATEp1, 
              event.ARTICLE, 
              event.COMP
           from event2 event, main.df2 A, main.df2 B 
           on event.COMP = A.COMP and event.COMP = B.COMP
              and event.DATE > A.DATE and event.DATE < B.DATE
           group by event.DATE, event.COMP
           order by event.COMP, event.DATE"))
    result
}

GG output:

#         DATE     DATEm1     DATEp1  ARTICLE COMP
# 1 2000-02-01 2000-01-01 2000-03-01 blabla11    A
# 2 2000-03-01 2000-02-01 2000-06-01 blabla12    A
# 3 2000-06-01 2000-03-01 2000-07-01 blabla13    A
# 4 2000-09-01 2000-07-01 2000-10-01 blabla14    A
# 5 2000-06-01 2000-04-01 2000-07-01 blabla21    B
# 6 2000-07-01 2000-06-01 2000-09-01 blabla22    B
# 7 2000-09-01 2000-07-01 2000-10-01 blabla23    B

Arun answer

# =================
# = Arun function =
# =================
# https://stackoverflow.com/a/23484292/2343633
# Deviations from desired format:
#  1) Column order (COMP first, ARTICLE does not come after DATEm1)
testAR <- function(df=df0, event=event0){
    dt1 = as.data.table(df)
    dt2 = as.data.table(event)

    key_cols = c("COMP", "DATE")
    setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
    setkeyv(dt1, key_cols)

    idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
    idx2 = dt1[dt2, which=TRUE, mult="last"]+1L

    idx1[idx1 == 0L] = NA

    dt2[, `:=`(DATEm1 = dt1$DATE[idx1], 
               DATEp1 = dt1$DATE[idx2]
      )]

    dt2
}

Arun output:

#    COMP       DATE  ARTICLE     DATEm1     DATEp1
# 1:    A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2:    A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3:    A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4:    A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5:    B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6:    B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7:    B 09.01.2000 blabla23 07.01.2000 10.01.2000

Benchmark

edit – note that this is the original benchmark (original code, original OP data set)

# =============
# = Benchmark =
# =============
microbenchmark(testAR(), testDA(), testRB(), testGG())

# Unit: milliseconds
#      expr       min        lq    median        uq       max neval
#  testAR()  3.220278  3.414430  3.509251  3.626438  7.209494   100
#  testDA()  4.273542  4.471227  4.569370  4.752857  6.460922   100
#  testRB()  5.704559  5.981680  6.135946  6.457392 14.309858   100
#  testGG() 22.337065 23.064494 23.964581 24.622467 50.934712   100

EDIT: Benchmark with larger data set

Note that I drop testGG() from this benchmark b/c it was far slower (I did some tests on a couple intermediate data sets, and tetGG() scaled worse than the other 3 approaches).

# ========
# = Data =
# ========
mos <- c("01","02","03","06","07","09","10", "01", "02", "04", "06", "07", "09", "10")
yrs <- 1920:2020
DATE <- paste(mos, "01", rep(yrs, each=length(mos)), sep=".")
RET <- rep(c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12), length(yrs))
COMP <- rep(c("A","A","A","A","A","A","A","B","B","B","B","B","B","B"), length(yrs))
df0 <- data.frame(DATE, RET, COMP)

mos2 <- c("02","03","06","09","06","07","09")
DATE <- paste(mos2, "01", rep(yrs, each=length(mos2)), sep=".")
ARTICLE <- rep(c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23"), length(yrs))
COMP <- rep(c("A","A","A","A","B","B","B"), length(yrs))
event0 <- data.frame(DATE, ARTICLE, COMP)

edit – original benchmarks for large dataset:

# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
#      expr        min         lq     median         uq        max neval
#  testAR()   3.458217   3.696698   3.934349   4.697033   6.584214   100
#  testDA() 143.180409 148.916461 151.776002 155.219515 237.524369   100
#  testRB()   7.279168   7.636102   8.073778   8.828537  11.143111   100

edit – benchmark for large dataset after DA optimization #1:

# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
#      expr       min        lq    median        uq      max neval
#  testAR()  3.198266  3.440739  3.605723  3.788199 22.52867   100
#  testDA() 56.290346 59.528819 60.821921 64.580825 80.99480   100
#  testRB()  6.763570  7.200741  7.400343  7.748849 20.97527   100

edit – benchmark for large data set after DA optimization #2:

NOTE – warning resulting from update #2 to testDA()

# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
#      expr      min       lq   median       uq      max neval
#  testAR() 3.423508 6.055584 6.246517 6.333444 7.653360   100
#  testDA() 2.665558 3.961070 4.062354 4.139571 8.427439   100
#  testRB() 6.421328 6.669137 6.877517 6.966977 8.271469   100
# There were 50 or more warnings (use warnings() to see the first 50)
# > warnings()[1]
# Warning message:
# In `[.data.table`(dt2, , `:=`(DATEm1 = dt1$DATE[idx1],  ... :
#   Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.

Memory and time profiling on large data set, 50 iterations each

Profiling code

Rprof("testAR.out", memory.profiling=TRUE)
for(i in 1:50){
    arAns <- testAR()
}
Rprof(NULL)

Rprof("testDA.out", memory.profiling=TRUE)
for(i in 1:50){
    daAns <- testDA()
}
Rprof(NULL)

Rprof("testRB.out", memory.profiling=TRUE)
for(i in 1:50){
    rbAns <- testRB()
}
Rprof(NULL)

testAR() profile results

# > summaryRprof("testAR.out", memory="both")$by.self
#                   self.time self.pct total.time total.pct mem.total
# "[["                   0.02       10       0.06        30       8.3
# "head"                 0.02       10       0.04        20      12.1
# "nrow"                 0.02       10       0.04        20      10.6
# ".Call"                0.02       10       0.02        10       8.2
# ".row_names_info"      0.02       10       0.02        10       8.4
# "<Anonymous>"          0.02       10       0.02        10       8.3
# "key"                  0.02       10       0.02        10       0.0
# "levels.default"       0.02       10       0.02        10       0.0
# "match"                0.02       10       0.02        10       0.0
# "stopifnot"            0.02       10       0.02        10       4.2

testDA() profile results

# > summaryRprof("testDA.out", memory="both")$by.self
#                   self.time self.pct total.time total.pct mem.total
# "match"                2.04    26.56       2.34     30.47      94.2
# "[.data.frame"         1.78    23.18       6.50     84.64     295.3
# "NextMethod"           0.80    10.42       0.80     10.42      33.9
# "strptime"             0.42     5.47       0.46      5.99      25.9
# "["                    0.34     4.43       7.14     92.97     335.9
# "[.Date"               0.34     4.43       1.14     14.84      49.8
# "names"                0.34     4.43       0.34      4.43      17.9
# "%in%"                 0.28     3.65       1.44     18.75      50.3
# "dim"                  0.28     3.65       0.30      3.91      13.9
# "order"                0.16     2.08       0.18      2.34       1.7
# "$"                    0.16     2.08       0.16      2.08       7.0
# ".Call"                0.14     1.82       6.76     88.02     308.4
# "length"               0.14     1.82       0.14      1.82       4.6
# "sys.call"             0.14     1.82       0.14      1.82       5.6
# "<Anonymous>"          0.04     0.52       0.04      0.52       9.5
# "as.Date.POSIXlt"      0.04     0.52       0.04      0.52       3.4
# "getwd"                0.04     0.52       0.04      0.52       9.5
# "do.call"              0.02     0.26       0.18      2.34       1.7
# "assign"               0.02     0.26       0.04      0.52       0.1
# ".subset2"             0.02     0.26       0.02      0.26       6.1
# "all"                  0.02     0.26       0.02      0.26       0.0
# "file.info"            0.02     0.26       0.02      0.26       0.0
# "is.data.table"        0.02     0.26       0.02      0.26       0.0
# "lockBinding"          0.02     0.26       0.02      0.26       0.1
# "parent.frame"         0.02     0.26       0.02      0.26       0.0
# "pmatch"               0.02     0.26       0.02      0.26       0.0
# "which"                0.02     0.26       0.02      0.26       6.5

testRB() profile results

# > summaryRprof("testRB.out", memory="both")$by.self
#                 self.time self.pct total.time total.pct mem.total
# "sort.list"          0.04     9.52       0.06     14.29      21.5
# "length"             0.04     9.52       0.04      9.52       0.0
# "pmatch"             0.04     9.52       0.04      9.52      13.9
# "[.data.table"       0.02     4.76       0.42    100.00      71.8
# ".Call"              0.02     4.76       0.12     28.57      39.6
# "split.default"      0.02     4.76       0.10     23.81      32.9
# "alloc.col"          0.02     4.76       0.08     19.05      13.3
# "[["                 0.02     4.76       0.04      9.52       6.9
# "cedta"              0.02     4.76       0.04      9.52       0.0
# "lapply"             0.02     4.76       0.04      9.52       0.0
# "[[.data.frame"      0.02     4.76       0.02      4.76       6.9
# "as.character"       0.02     4.76       0.02      4.76       6.0
# "as.name"            0.02     4.76       0.02      4.76       5.3
# "attr"               0.02     4.76       0.02      4.76       0.0
# "exists"             0.02     4.76       0.02      4.76       0.0
# "FUN"                0.02     4.76       0.02      4.76       0.0
# "intersect"          0.02     4.76       0.02      4.76       6.5
# "is.data.table"      0.02     4.76       0.02      4.76       0.0

Conclusion

As far as I can tell, Arun's answer is the fastest and most memory efficient. rbatt answer scales better with data set size than does DA's answer – my initial guess was that approaches using POSIX or Date classes might not scale well, but I'm unsure if this hunch is supported by the profiling results. If someone thinks it would be helpful, I could provide the full profile results, instead of just the $by.self portion.

Also worth noting is that time spent and memory used were negatively correlated among approaches – the fastest approaches used the least memory.

112

answered Sep 22 '22 03:09

rbatt

Here's another approach using data.table:

First, we convert df and event to data.tables. Here I'll use as.data.table(.). But you can use setDT if you don't want to make a copy. That is, by doing setDT(df), df will be modified by reference to a data.table.

require(data.table) ## >= 1.9.2
dt1 = as.data.table(df)
dt2 = as.data.table(event)

Then we'll prepare the data as follows:

key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)

The setcolorder rearranges the columns of your data.tables by reference. setkeyv sorts the data.table by the given columns in ascending order, and marks the key columns for dt1.

The column reordering is essential here because, we don't set key on dt2 (because that'll sort dt2 which may be undesirable for you). And since no key is set of dt2, data.table takes the first 'n' (=2 here) columns from dt2 to match with the key columns from dt1.

Note: A join x[i] using data.table absolutely requires key of x to be set. Here x = dt1. Setting key on i is optional, depending on if you wish the order to be preserved or not.

Now, we perform two joins and get the corresponding matching indices:

idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L

The first join gets for each match of dt2 in dt1, the first matching position in dt1. Similarly, the second join gets for each match of dt2 in dt1, the last matching position in dt1. We add -1 and +1 to get the previous and next indices respectively.

Take care of one special case:

idx1[idx1 == 0L] = NA

When the matching index is 1, subtracting it will result in 0. Because of R's behaviour on 0-index, we've to explicitly replace it with NA here.

Now, we can just subset those dates and add it to dt2 by reference as follows:

dt2[, `:=`(DATEm1 = dt1$DATE[idx1], 
           DATEp1 = dt1$DATE[idx2]
  )]

#    COMP       DATE  ARTICLE     DATEm1     DATEp1
# 1:    A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2:    A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3:    A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4:    A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5:    B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6:    B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7:    B 09.01.2000 blabla23 07.01.2000 10.01.2000

answered Sep 21 '22 03:09

Arun

Related questions
                            
                                Recoding Numeric Vector R
                            
                                How to count how many times a value occurs after another
                            
                                how to save print(i/j) to an output file?
                            
                                "with" function behaviour
                            
                                Calculating the mean of values in tables using formulae [R]
                            
                                How to load big csv file with mixed-type columns using the bigmemory package
                            
                                Slower ddply when .parallel=TRUE on Mac OS X Version 10.6.7
                            
                                Why is R slow on this random permutation function?
                            
                                ggplot2: Quick Heatmap Plotting, reshape?
                            
                                How do I find equal columns in R?
                            
                                sum different columns in a data.frame
                            
                                Determining minimum values in a vector in R
                            
                                Superposed xyplot Panels with Grouped Regression Lines
                            
                                inverse of gsub
                            
                                Recode dataframe based on one column
                            
                                for loop object names iterate [duplicate]
                            
                                How to check if an R environment exists
                            
                                Create category based on range in R
                            
                                Use paste to combine letters instead or loops. R
                            
                                Rcpp function to select (and to return) a sub-dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With