Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I create this variable in R?

Consider the following test data set using R:

testdat<-data.frame("id"=c(rep(1,5),rep(2,5),rep(3,5)),
                    "period"=rep(seq(1:5),3),
                    "treat"=c(c(0,1,1,1,0),c(0,0,1,1,1),c(0,0,1,1,1)),
                    "state"=c(rep(0,5),c(0,1,1,1,1),c(0,0,0,1,1)),
                    "int"=c(rep(0,13),1,1))
testdat
   id period treat state int
1   1      1     0     0   0
2   1      2     1     0   0
3   1      3     1     0   0
4   1      4     1     0   0
5   1      5     0     0   0
6   2      1     0     0   0
7   2      2     0     1   0
8   2      3     1     1   0
9   2      4     1     1   0
10  2      5     1     1   0
11  3      1     0     0   0
12  3      2     0     0   0
13  3      3     1     0   0
14  3      4     1     1   1
15  3      5     1     1   1

The first 4 variables are what I have, int is the variable I want to make. It is similar to an interaction between treat and state, but that would include 1s in rows 8-10 which is not desired. Essentially, I only want an interaction when state changes during treat but not otherwise. Any thoughts on how to create this (especially on a large scale for a dataset with a million observations)?

Edit: For clarification on why I want this measure. I want to run something like the following regression:

lm(outcome~treat+state+I(treat*state))

But I'm really interested in the interaction only when treat straddles a change in state. If I were to run the above regression, I(treat*state) pools the effect of the interaction I'm interested in and when treat is 1 entirely when state is 1. In theory, I think these will have two different effects so I need to disaggregate them. I hope this makes sense and I am happy to provide additional details.

like image 232
dkro Avatar asked May 09 '26 11:05

dkro


1 Answers

I'm sure this is possible in base R, but here's a tidyversion:

library(dplyr)
testdat %>%
  group_by(grp = cumsum(c(FALSE, diff(treat) > 0))) %>%
  mutate(int2 = +(state > 0 & first(state) == 0 & treat > 0)) %>%
  ungroup() %>%
  select(-grp)
# # A tibble: 15 x 6
#       id period treat state   int  int2
#    <dbl>  <int> <dbl> <dbl> <dbl> <int>
#  1     1      1     0     0     0     0
#  2     1      2     1     0     0     0
#  3     1      3     1     0     0     0
#  4     1      4     1     0     0     0
#  5     1      5     0     0     0     0
#  6     2      1     0     0     0     0
#  7     2      2     0     1     0     0
#  8     2      3     1     1     0     0
#  9     2      4     1     1     0     0
# 10     2      5     1     1     0     0
# 11     3      1     0     0     0     0
# 12     3      2     0     0     0     0
# 13     3      3     1     0     0     0
# 14     3      4     1     1     1     1
# 15     3      5     1     1     1     1

Alternative logic for grouping uses run-length encoding, effectively the same (suggested you https://stackoverflow.com/a/35313426):

testdat %>%
  group_by(grp = { yy <- rle(treat); rep(seq_along(yy$lengths), yy$lengths); }) %>%
  # ...

And as in that answer, I wish dplyr had an equivalent to data.table's rleid. The expected logic is to be able to group by consecutive same-values in a column, but not the same value across all rows. If you look at this mid-pipe (before cleaning up grp), you'd see

testdat %>%
  group_by(grp = { yy <- rle(treat); rep(seq_along(yy$lengths), yy$lengths); }) %>%
  mutate(int2 = +(state > 0 & first(state) == 0 & treat > 0)) %>%
  ungroup()
# # A tibble: 15 x 7
#       id period treat state   int   grp  int2
#    <dbl>  <int> <dbl> <dbl> <dbl> <int> <int>
#  1     1      1     0     0     0     1     0
#  2     1      2     1     0     0     2     0
#  3     1      3     1     0     0     2     0
#  4     1      4     1     0     0     2     0
#  5     1      5     0     0     0     3     0
#  6     2      1     0     0     0     3     0
#  7     2      2     0     1     0     3     0
#  8     2      3     1     1     0     4     0
#  9     2      4     1     1     0     4     0
# 10     2      5     1     1     0     4     0
# 11     3      1     0     0     0     5     0
# 12     3      2     0     0     0     5     0
# 13     3      3     1     0     0     6     0
# 14     3      4     1     1     1     6     1
# 15     3      5     1     1     1     6     1

But that's just wishful thinking. I guess I could also do

my_rleid <- function(x) { yy <- rle(x); rep(seq_along(yy$lengths), yy$lengths); }
testdat %>%
  group_by(grp = my_rleid(treat)) %>%
  # ...
like image 66
r2evans Avatar answered May 11 '26 00:05

r2evans