Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add dummies with conditions in data.table?

Tags:

r

data.table

Sorry for the long question. I will try my best to clarify my goal clealy

I want to add dummies in data.table using the update method, just like this already answered in this link, but a little bit more complicated.

For better description, I created the data.

DT <- data.table(UID = paste0("UID",rep(1:5,each=2)), 
                 date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
                 value = c(1:10)) 

The DT is a data.table containing the information of UID, date, and value. In the original data, the structure is just the same, but with long time span(2 years).

Here I want to add dummies based on the date.

there're several special time spans in date, we can just use vacations to represent them.

For example, in the fake data I created above.

There're two vacations

  1. From "2012-01-02" to "2012-01-05"
  2. From "2012-02-02" to "2012-02-03"

I want to add 2 types of dummies

  1. Dummies about the length of vacation: First calculate the length from different vacations. In this example, we have two different lengths (2, and 4 ). So we'll add 2 dummies indicating whether the date are in these vacations.

The expected result is like this:

UID     Date    Val D_length_2  D_length_4
UID1    1/1/2012    1   FALSE   FALSE
UID2    1/2/2012    2   FALSE   TRUE
UID3    1/3/2012    3   FALSE   TRUE
UID4    1/4/2012    4   FALSE   TRUE
UID5    1/5/2012    5   FALSE   TRUE
UID1    1/6/2012    6   FALSE   FALSE
UID2    2/1/2012    7   TRUE    FALSE
UID3    2/2/2012    8   TRUE    FALSE
UID4    2/3/2012    9   FALSE   FALSE
UID5    2/4/2012    10  FALSE   FALSE
  1. Dummies about whether the day is exactly one day before the vacation, or exactly one day after the vacation.
UID    Date      Val    Before  After
UID1    1/1/2012    1   TRUE    FALSE
UID2    1/2/2012    2   FALSE   FALSE
UID3    1/3/2012    3   FALSE   FALSE
UID4    1/4/2012    4   FALSE   FALSE
UID5    1/5/2012    5   FALSE   FALSE
UID1    1/6/2012    6   FALSE   TRUE
UID2    2/1/2012    7   TRUE    FALSE
UID3    2/2/2012    8   FALSE   FALSE
UID4    2/3/2012    9   FALSE   FALSE
UID5    2/4/2012    10  FALSE   TRUE

So the total of desired results is like this

UID Date    Val Before  After   D_length_2  D_length_4
UID1    1/1/2012    1   TRUE    FALSE   FALSE   FALSE
UID2    1/2/2012    2   FALSE   FALSE   FALSE   TRUE
UID3    1/3/2012    3   FALSE   FALSE   FALSE   TRUE
UID4    1/4/2012    4   FALSE   FALSE   FALSE   TRUE
UID5    1/5/2012    5   FALSE   FALSE   FALSE   TRUE
UID1    1/6/2012    6   FALSE   TRUE    FALSE   FALSE
UID2    2/1/2012    7   TRUE    FALSE   FALSE   FALSE
UID3    2/2/2012    8   FALSE   FALSE   TRUE    FALSE
UID4    2/3/2012    9   FALSE   FALSE   TRUE    FALSE
UID5    2/4/2012    10  FALSE   TRUE    FALSE   FALSE

The total observations are more than 10M rows, with about 10 different vacations and 4 different length.

For the second type of dummies, I think

f <- function(x){ 
ifelse(x %in% as.Date(c("2012-01-02","2012-02-02")) - 1, return(TRUE), return(FALSE))
}

DT[,Before:= f(date)] 

But it seems not correct.

For the first one, I didn't come up with a good solution.

this problem is about the update in data.table, any thoughts about how to deal with it and how to write the update functions are extremely welcome!

like image 342
Bigchao Avatar asked Jan 15 '14 14:01

Bigchao


1 Answers

Here is a start:

library(data.table)

DT <- data.table(UID = paste0("UID",rep(1:5,each=2)), 
                 date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
                 value = c(1:10)) 
setkey(DT, date)


vacStart <- data.table(start = as.IDate(c("2012-01-02", "2012-02-02")), key="start")
vacEnd <- data.table(date = as.IDate(c("2012-01-05", "2012-02-03")), key="date")

#identify vacations:
vacStart[, Start:=.I]
DT <- vacStart[DT, roll=TRUE]
vacEnd[, End:=.I]
DT <- vacEnd[DT, roll=-Inf]
DT[,vac:=(End==Start)*Start]
DT[is.na(vac), vac:=0L]

#2-day vacations:
DT[,length_2 := (.N==2) & vac!=0, by=vac]
#days before vacation
DT[,before := c(diff(vac)>0, FALSE) & vac==0]
#           date End Start  UID value vac length_2 before
#  1: 2012-01-01   1    NA UID1     1   0    FALSE   TRUE
#  2: 2012-01-02   1     1 UID1     2   1    FALSE  FALSE
#  3: 2012-01-03   1     1 UID2     3   1    FALSE  FALSE
#  4: 2012-01-04   1     1 UID2     4   1    FALSE  FALSE
#  5: 2012-01-05   1     1 UID3     5   1    FALSE  FALSE
#  6: 2012-01-06   2     1 UID3     6   0    FALSE  FALSE
#  7: 2012-02-01   2     1 UID4     7   0    FALSE   TRUE
#  8: 2012-02-02   2     2 UID4     8   2     TRUE  FALSE
#  9: 2012-02-03   2     2 UID5     9   2     TRUE  FALSE
# 10: 2012-02-04  NA     2 UID5    10   0    FALSE  FALSE
like image 51
Roland Avatar answered Sep 25 '22 00:09

Roland