I have a somewhat large data set (784,932 rows/items, 27,492 unique IDs). For each Item in each ID, I am trying to create a dummy variable equal to 1, if the difference between the dates is less than 60 secs.
Stylised data and code:
ID <- c(1,1,1,1,1,1,3,3,3,3,3,3)
Item <- c(10,10,10,20,20,20,10,20,10,10,10,20)
Date <- c("19/11/13 18:58:00","19/11/13 18:58:21","19/11/13 20:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 19:58:00")
df <- data.frame(ID, Item, Date)
df <- df[order(ID, Date), ]
df[, "Date"] = lapply(df["Date"],function(x){strptime(x, "%d/%m/%y %H:%M:%S")})
# less than 60 sec difference = 1 (first item set to 999 -> 0)
fnDummy <- function(date) { ifelse(c(999, diff(date))<60, 1, 0) }
library(plyr)
ddply(df, .(ID, Item), transform, Dummy=fnDummy(Date) )
Output:
ID Item Date Dummy
1 1 10 2013-11-19 18:58:00 0
2 1 10 2013-11-19 18:58:21 1
3 1 10 2013-11-19 20:58:00 0
4 1 20 2013-11-19 18:58:00 0
5 1 20 2013-11-19 18:58:00 1
6 1 20 2013-11-19 18:58:00 1
7 3 10 2013-11-19 18:58:00 0
8 3 10 2013-11-19 18:58:00 1
9 3 10 2013-11-19 18:58:00 1
10 3 10 2013-11-19 18:58:00 1
11 3 20 2013-11-19 18:58:00 0
12 3 20 2013-11-19 19:58:00 1
From the output you see that the first and second row have common ID and Item, and the difference in Date is only 21 secs, so the dummy is 1. The second and third row also have common ID and Item, but here the difference in date is much larger than 60 secs, so dummy is 0.
I managed to get the output I wanted, but the operation is slow. For 1000 rows it takes about 40 sec (see system.time
results below). This corresponding to approx. 180 minutes for the entire data set (my computer runs out of memory and crashes long before this).
user system elapsed
36.485 3.328 39.800
How can I make this operation faster? Can i accomplish the same output using data.table
, and is it faster?
You can use data.table
as you suggest. You'll have to convert your POSIXlt
to POSIXct
though:
library(data.table)
df$Date <- as.POSIXct(df$Date)
DT <- as.data.table(df)
DT[, dummy_date := fnDummy(Date), by=c('ID', 'Item')]
However, a big part of the slowdown is probably in the ifelse
function and you don't really need it since you're creating a boolean:
as.integer(c(FALSE, diff(date) < 60))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With