Given a very large longitudinal dataset with different groups, I need to create a flag that indicates the first change in a certain variable (code
) between years (year
), per group (id
). The type
of observation within the same id-year just indicates different group members.
Sample data:
library(tidyverse)
sample <- tibble(id = rep(1:3, each=6),
year = rep(2010:2012, 3, each=2),
type = (rep(1:2, 9)),
code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","","klm","nop","nop"))
What I need is to flag the first change to code
within a group, between years. Second changes do not matter. Missing codes (""
) can be treated as NA
but in any case should not affect flag
. The following is the above tibble with a flag field as it should be:
# A tibble: 18 × 5
id year type code flag
<int> <int> <int> <chr> <dbl>
1 1 2010 1 abc 0
2 1 2010 2 abc 0
3 1 2011 1 0
4 1 2011 2 0
5 1 2012 1 xyz 1
6 1 2012 2 xyz 1
7 2 2010 1 0
8 2 2010 2 0
9 2 2011 1 lmn 0
10 2 2011 2 0
11 2 2012 1 efg 1
12 2 2012 2 efg 1
13 3 2010 1 def 0
14 3 2010 2 def 0
15 3 2011 1 1
16 3 2011 2 klm 1
17 3 2012 1 nop 1
18 3 2012 2 nop 1
I still have a looping mindset and I am trying to use vectorized dplyr to do what I need. Any input would be greatly appreciated!
EDIT: thanks for pointing this out regarding the importance of year
. The id's are arranged by year, as the ordering is important here, and also all types
per id
per year
need to have the same flag. So, in the edited row 15 e code is ""
which would not warrant a change by itself, but since in the same year row 16 has a new code
, both observations need to have their codes changed to 1.
We can use data.table
library(data.table)
setDT(sample)[, flag :=0][code!="", flag := {rl <- rleid(code)-1; cummax(rl*(rl < 2)) }, id]
sample
# id year type code flag
# 1: 1 2010 1 abc 0
# 2: 1 2010 2 abc 0
# 3: 1 2011 1 0
# 4: 1 2011 2 0
# 5: 1 2012 1 xyz 1
# 6: 1 2012 2 xyz 1
# 7: 2 2010 1 0
# 8: 2 2010 2 0
# 9: 2 2011 1 lmn 0
#10: 2 2011 2 0
#11: 2 2012 1 efg 1
#12: 2 2012 2 efg 1
#13: 3 2010 1 def 0
#14: 3 2010 2 def 0
#15: 3 2011 1 klm 1
#16: 3 2011 2 klm 1
#17: 3 2012 1 nop 1
#18: 3 2012 2 nop 1
If we need to include the 'year' as well,
setDT(sample)[, flag :=0][code!="", flag := {rl <- rleid(code, year)-1
cummax(rl*(rl < 2)) }, id]
possible solution using the dplyr
. not sure its the cleanest way though
sample %>%
group_by(id) %>%
#find first year per group where code exists
mutate(first_year = min(year[code != ""])) %>%
#gather all codes from first year (does not assume code is constant within year)
mutate(first_codes = list(code[year==first_year])) %>%
#if year is not first year & code not in first year codes & code not blank
mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>%
#drop created columns
select(-first_year, -first_codes) %>%
ungroup()
output
# A tibble: 18 × 5
id year type code flag
<int> <int> <int> <chr> <dbl>
1 1 2010 1 abc 0
2 1 2010 2 abc 0
3 1 2011 1 0
4 1 2011 2 0
5 1 2012 1 xyz 1
6 1 2012 2 xyz 1
7 2 2010 1 0
8 2 2010 2 0
9 2 2011 1 lmn 0
10 2 2011 2 0
11 2 2012 1 efg 1
12 2 2012 2 efg 1
13 3 2010 1 def 0
14 3 2010 2 def 0
15 3 2011 1 klm 1
16 3 2011 2 klm 1
17 3 2012 1 nop 1
18 3 2012 2 nop 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With