Consider the following test data set using R:
testdat<-data.frame("id"=c(rep(1,5),rep(2,5),rep(3,5)),
"period"=rep(seq(1:5),3),
"treat"=c(c(0,1,1,1,0),c(0,0,1,1,1),c(0,0,1,1,1)),
"state"=c(rep(0,5),c(0,1,1,1,1),c(0,0,0,1,1)),
"int"=c(rep(0,13),1,1))
testdat
id period treat state int
1 1 1 0 0 0
2 1 2 1 0 0
3 1 3 1 0 0
4 1 4 1 0 0
5 1 5 0 0 0
6 2 1 0 0 0
7 2 2 0 1 0
8 2 3 1 1 0
9 2 4 1 1 0
10 2 5 1 1 0
11 3 1 0 0 0
12 3 2 0 0 0
13 3 3 1 0 0
14 3 4 1 1 1
15 3 5 1 1 1
The first 4 variables are what I have, int is the variable I want to make. It is similar to an interaction between treat and state, but that would include 1s in rows 8-10 which is not desired. Essentially, I only want an interaction when state changes during treat but not otherwise. Any thoughts on how to create this (especially on a large scale for a dataset with a million observations)?
Edit: For clarification on why I want this measure. I want to run something like the following regression:
lm(outcome~treat+state+I(treat*state))
But I'm really interested in the interaction only when treat straddles a change in state. If I were to run the above regression, I(treat*state) pools the effect of the interaction I'm interested in and when treat is 1 entirely when state is 1. In theory, I think these will have two different effects so I need to disaggregate them. I hope this makes sense and I am happy to provide additional details.
I'm sure this is possible in base R, but here's a tidyversion:
library(dplyr)
testdat %>%
group_by(grp = cumsum(c(FALSE, diff(treat) > 0))) %>%
mutate(int2 = +(state > 0 & first(state) == 0 & treat > 0)) %>%
ungroup() %>%
select(-grp)
# # A tibble: 15 x 6
# id period treat state int int2
# <dbl> <int> <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0 0 0
# 2 1 2 1 0 0 0
# 3 1 3 1 0 0 0
# 4 1 4 1 0 0 0
# 5 1 5 0 0 0 0
# 6 2 1 0 0 0 0
# 7 2 2 0 1 0 0
# 8 2 3 1 1 0 0
# 9 2 4 1 1 0 0
# 10 2 5 1 1 0 0
# 11 3 1 0 0 0 0
# 12 3 2 0 0 0 0
# 13 3 3 1 0 0 0
# 14 3 4 1 1 1 1
# 15 3 5 1 1 1 1
Alternative logic for grouping uses run-length encoding, effectively the same (suggested you https://stackoverflow.com/a/35313426):
testdat %>%
group_by(grp = { yy <- rle(treat); rep(seq_along(yy$lengths), yy$lengths); }) %>%
# ...
And as in that answer, I wish dplyr had an equivalent to data.table's rleid. The expected logic is to be able to group by consecutive same-values in a column, but not the same value across all rows. If you look at this mid-pipe (before cleaning up grp), you'd see
testdat %>%
group_by(grp = { yy <- rle(treat); rep(seq_along(yy$lengths), yy$lengths); }) %>%
mutate(int2 = +(state > 0 & first(state) == 0 & treat > 0)) %>%
ungroup()
# # A tibble: 15 x 7
# id period treat state int grp int2
# <dbl> <int> <dbl> <dbl> <dbl> <int> <int>
# 1 1 1 0 0 0 1 0
# 2 1 2 1 0 0 2 0
# 3 1 3 1 0 0 2 0
# 4 1 4 1 0 0 2 0
# 5 1 5 0 0 0 3 0
# 6 2 1 0 0 0 3 0
# 7 2 2 0 1 0 3 0
# 8 2 3 1 1 0 4 0
# 9 2 4 1 1 0 4 0
# 10 2 5 1 1 0 4 0
# 11 3 1 0 0 0 5 0
# 12 3 2 0 0 0 5 0
# 13 3 3 1 0 0 6 0
# 14 3 4 1 1 1 6 1
# 15 3 5 1 1 1 6 1
But that's just wishful thinking. I guess I could also do
my_rleid <- function(x) { yy <- rle(x); rep(seq_along(yy$lengths), yy$lengths); }
testdat %>%
group_by(grp = my_rleid(treat)) %>%
# ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With