Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to flag first change in a variable value between years, per group?

Tags:

loops

r

dplyr

Given a very large longitudinal dataset with different groups, I need to create a flag that indicates the first change in a certain variable (code) between years (year), per group (id). The type of observation within the same id-year just indicates different group members.

Sample data:

library(tidyverse)    
sample <- tibble(id = rep(1:3, each=6),
                     year = rep(2010:2012, 3, each=2),
                     type = (rep(1:2, 9)),
                     code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","","klm","nop","nop"))

What I need is to flag the first change to code within a group, between years. Second changes do not matter. Missing codes ("") can be treated as NA but in any case should not affect flag. The following is the above tibble with a flag field as it should be:

# A tibble: 18 × 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1           1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1

I still have a looping mindset and I am trying to use vectorized dplyr to do what I need. Any input would be greatly appreciated!

EDIT: thanks for pointing this out regarding the importance of year. The id's are arranged by year, as the ordering is important here, and also all types per id per year need to have the same flag. So, in the edited row 15 e code is "" which would not warrant a change by itself, but since in the same year row 16 has a new code, both observations need to have their codes changed to 1.

like image 520
Yuval Spiegler Avatar asked Mar 09 '23 03:03

Yuval Spiegler


2 Answers

We can use data.table

library(data.table)
setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code)-1; cummax(rl*(rl < 2)) }, id]
sample
#    id year type code flag
# 1:  1 2010    1  abc    0
# 2:  1 2010    2  abc    0
# 3:  1 2011    1         0
# 4:  1 2011    2         0
# 5:  1 2012    1  xyz    1
# 6:  1 2012    2  xyz    1
# 7:  2 2010    1         0
# 8:  2 2010    2         0
# 9:  2 2011    1  lmn    0
#10:  2 2011    2         0
#11:  2 2012    1  efg    1
#12:  2 2012    2  efg    1
#13:  3 2010    1  def    0
#14:  3 2010    2  def    0
#15:  3 2011    1  klm    1
#16:  3 2011    2  klm    1
#17:  3 2012    1  nop    1
#18:  3 2012    2  nop    1

Update

If we need to include the 'year' as well,

setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code, year)-1
                   cummax(rl*(rl < 2)) }, id]
like image 60
akrun Avatar answered May 03 '23 13:05

akrun


possible solution using the dplyr. not sure its the cleanest way though

sample %>% 
  group_by(id) %>% 
  #find first year per group where code exists
  mutate(first_year = min(year[code != ""])) %>% 
  #gather all codes from first year (does not assume code is constant within year)
  mutate(first_codes = list(code[year==first_year])) %>% 
  #if year is not first year & code not in first year codes & code not blank
  mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>% 
  #drop created columns
  select(-first_year, -first_codes) %>% 
  ungroup()

output

# A tibble: 18 × 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1   klm     1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1
like image 37
Adam Spannbauer Avatar answered May 03 '23 14:05

Adam Spannbauer