How to flag first change in a variable value between years, per group?

Question

Given a very large longitudinal dataset with different groups, I need to create a flag that indicates the first change in a certain variable (code) between years (year), per group (id). The type of observation within the same id-year just indicates different group members.

Sample data:

library(tidyverse)    
sample <- tibble(id = rep(1:3, each=6),
                     year = rep(2010:2012, 3, each=2),
                     type = (rep(1:2, 9)),
                     code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","","klm","nop","nop"))

What I need is to flag the first change to code within a group, between years. Second changes do not matter. Missing codes ("") can be treated as NA but in any case should not affect flag. The following is the above tibble with a flag field as it should be:

# A tibble: 18 × 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1           1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1

I still have a looping mindset and I am trying to use vectorized dplyr to do what I need. Any input would be greatly appreciated!

EDIT: thanks for pointing this out regarding the importance of year. The id's are arranged by year, as the ordering is important here, and also all types per id per year need to have the same flag. So, in the edited row 15 e code is "" which would not warrant a change by itself, but since in the same year row 16 has a new code, both observations need to have their codes changed to 1.

akrun · Accepted Answer

We can use data.table

library(data.table)
setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code)-1; cummax(rl*(rl < 2)) }, id]
sample
#    id year type code flag
# 1:  1 2010    1  abc    0
# 2:  1 2010    2  abc    0
# 3:  1 2011    1         0
# 4:  1 2011    2         0
# 5:  1 2012    1  xyz    1
# 6:  1 2012    2  xyz    1
# 7:  2 2010    1         0
# 8:  2 2010    2         0
# 9:  2 2011    1  lmn    0
#10:  2 2011    2         0
#11:  2 2012    1  efg    1
#12:  2 2012    2  efg    1
#13:  3 2010    1  def    0
#14:  3 2010    2  def    0
#15:  3 2011    1  klm    1
#16:  3 2011    2  klm    1
#17:  3 2012    1  nop    1
#18:  3 2012    2  nop    1

Update

If we need to include the 'year' as well,

setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code, year)-1
                   cummax(rl*(rl < 2)) }, id]

Adam Spannbauer · Answer

possible solution using the dplyr. not sure its the cleanest way though

sample %>% 
  group_by(id) %>% 
  #find first year per group where code exists
  mutate(first_year = min(year[code != ""])) %>% 
  #gather all codes from first year (does not assume code is constant within year)
  mutate(first_codes = list(code[year==first_year])) %>% 
  #if year is not first year & code not in first year codes & code not blank
  mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>% 
  #drop created columns
  select(-first_year, -first_codes) %>% 
  ungroup()

output

# A tibble: 18 × 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1   klm     1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1

How to flag first change in a variable value between years, per group?

Tags:

loops

r

dplyr

Yuval Spiegler

2 Answers

Update

akrun

Adam Spannbauer

Recent Activity

Donate For Us

How to flag first change in a variable value between years, per group?

Tags:

loops

r

dplyr

Yuval Spiegler

2 Answers

Update

akrun

Adam Spannbauer

Related questions

Recent Activity

Donate For Us