Sorry for the long question. I will try my best to clarify my goal clealy
I want to add dummies in data.table using the update method, just like this already answered in this link, but a little bit more complicated.
For better description, I created the data.
DT <- data.table(UID = paste0("UID",rep(1:5,each=2)),
date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
value = c(1:10))
The DT is a data.table containing the information of UID, date, and value. In the original data, the structure is just the same, but with long time span(2 years).
Here I want to add dummies based on the date.
there're several special time spans in date, we can just use vacations to represent them.
For example, in the fake data I created above.
There're two vacations
I want to add 2 types of dummies
The expected result is like this:
UID Date Val D_length_2 D_length_4 UID1 1/1/2012 1 FALSE FALSE UID2 1/2/2012 2 FALSE TRUE UID3 1/3/2012 3 FALSE TRUE UID4 1/4/2012 4 FALSE TRUE UID5 1/5/2012 5 FALSE TRUE UID1 1/6/2012 6 FALSE FALSE UID2 2/1/2012 7 TRUE FALSE UID3 2/2/2012 8 TRUE FALSE UID4 2/3/2012 9 FALSE FALSE UID5 2/4/2012 10 FALSE FALSE
UID Date Val Before After UID1 1/1/2012 1 TRUE FALSE UID2 1/2/2012 2 FALSE FALSE UID3 1/3/2012 3 FALSE FALSE UID4 1/4/2012 4 FALSE FALSE UID5 1/5/2012 5 FALSE FALSE UID1 1/6/2012 6 FALSE TRUE UID2 2/1/2012 7 TRUE FALSE UID3 2/2/2012 8 FALSE FALSE UID4 2/3/2012 9 FALSE FALSE UID5 2/4/2012 10 FALSE TRUE
So the total of desired results is like this
UID Date Val Before After D_length_2 D_length_4 UID1 1/1/2012 1 TRUE FALSE FALSE FALSE UID2 1/2/2012 2 FALSE FALSE FALSE TRUE UID3 1/3/2012 3 FALSE FALSE FALSE TRUE UID4 1/4/2012 4 FALSE FALSE FALSE TRUE UID5 1/5/2012 5 FALSE FALSE FALSE TRUE UID1 1/6/2012 6 FALSE TRUE FALSE FALSE UID2 2/1/2012 7 TRUE FALSE FALSE FALSE UID3 2/2/2012 8 FALSE FALSE TRUE FALSE UID4 2/3/2012 9 FALSE FALSE TRUE FALSE UID5 2/4/2012 10 FALSE TRUE FALSE FALSE
The total observations are more than 10M rows, with about 10 different vacations and 4 different length.
For the second type of dummies, I think
f <- function(x){
ifelse(x %in% as.Date(c("2012-01-02","2012-02-02")) - 1, return(TRUE), return(FALSE))
}
DT[,Before:= f(date)]
But it seems not correct.
For the first one, I didn't come up with a good solution.
this problem is about the update in data.table, any thoughts about how to deal with it and how to write the update functions are extremely welcome!
Here is a start:
library(data.table)
DT <- data.table(UID = paste0("UID",rep(1:5,each=2)),
date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
value = c(1:10))
setkey(DT, date)
vacStart <- data.table(start = as.IDate(c("2012-01-02", "2012-02-02")), key="start")
vacEnd <- data.table(date = as.IDate(c("2012-01-05", "2012-02-03")), key="date")
#identify vacations:
vacStart[, Start:=.I]
DT <- vacStart[DT, roll=TRUE]
vacEnd[, End:=.I]
DT <- vacEnd[DT, roll=-Inf]
DT[,vac:=(End==Start)*Start]
DT[is.na(vac), vac:=0L]
#2-day vacations:
DT[,length_2 := (.N==2) & vac!=0, by=vac]
#days before vacation
DT[,before := c(diff(vac)>0, FALSE) & vac==0]
# date End Start UID value vac length_2 before
# 1: 2012-01-01 1 NA UID1 1 0 FALSE TRUE
# 2: 2012-01-02 1 1 UID1 2 1 FALSE FALSE
# 3: 2012-01-03 1 1 UID2 3 1 FALSE FALSE
# 4: 2012-01-04 1 1 UID2 4 1 FALSE FALSE
# 5: 2012-01-05 1 1 UID3 5 1 FALSE FALSE
# 6: 2012-01-06 2 1 UID3 6 0 FALSE FALSE
# 7: 2012-02-01 2 1 UID4 7 0 FALSE TRUE
# 8: 2012-02-02 2 2 UID4 8 2 TRUE FALSE
# 9: 2012-02-03 2 2 UID5 9 2 TRUE FALSE
# 10: 2012-02-04 NA 2 UID5 10 0 FALSE FALSE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With