Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to make transformations on subset of a subset fast (optimize or alternative to `ddply`)?

I have a somewhat large data set (784,932 rows/items, 27,492 unique IDs). For each Item in each ID, I am trying to create a dummy variable equal to 1, if the difference between the dates is less than 60 secs.

Stylised data and code:

ID <- c(1,1,1,1,1,1,3,3,3,3,3,3)
Item <- c(10,10,10,20,20,20,10,20,10,10,10,20)
Date <- c("19/11/13 18:58:00","19/11/13 18:58:21","19/11/13 20:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 18:58:00","19/11/13 19:58:00")
df <- data.frame(ID, Item, Date)
df <- df[order(ID, Date), ]
df[, "Date"] = lapply(df["Date"],function(x){strptime(x, "%d/%m/%y %H:%M:%S")})

# less than 60 sec difference = 1 (first item set to 999 -> 0)
fnDummy <- function(date) { ifelse(c(999, diff(date))<60, 1, 0) }

library(plyr)
ddply(df, .(ID, Item), transform, Dummy=fnDummy(Date) )

Output:

   ID Item                Date Dummy
1   1   10 2013-11-19 18:58:00     0
2   1   10 2013-11-19 18:58:21     1
3   1   10 2013-11-19 20:58:00     0
4   1   20 2013-11-19 18:58:00     0
5   1   20 2013-11-19 18:58:00     1
6   1   20 2013-11-19 18:58:00     1
7   3   10 2013-11-19 18:58:00     0
8   3   10 2013-11-19 18:58:00     1
9   3   10 2013-11-19 18:58:00     1
10  3   10 2013-11-19 18:58:00     1
11  3   20 2013-11-19 18:58:00     0
12  3   20 2013-11-19 19:58:00     1

From the output you see that the first and second row have common ID and Item, and the difference in Date is only 21 secs, so the dummy is 1. The second and third row also have common ID and Item, but here the difference in date is much larger than 60 secs, so dummy is 0.

I managed to get the output I wanted, but the operation is slow. For 1000 rows it takes about 40 sec (see system.time results below). This corresponding to approx. 180 minutes for the entire data set (my computer runs out of memory and crashes long before this).

   user  system elapsed 
 36.485   3.328  39.800 

How can I make this operation faster? Can i accomplish the same output using data.table, and is it faster?

like image 950
bonna Avatar asked Jan 12 '23 08:01

bonna


1 Answers

You can use data.table as you suggest. You'll have to convert your POSIXlt to POSIXct though:

library(data.table)
df$Date <- as.POSIXct(df$Date)
DT <- as.data.table(df)

DT[, dummy_date := fnDummy(Date), by=c('ID', 'Item')]

However, a big part of the slowdown is probably in the ifelse function and you don't really need it since you're creating a boolean:

as.integer(c(FALSE, diff(date) < 60))
like image 86
Justin Avatar answered Jan 22 '23 22:01

Justin