I'm trying to subset the data so it only preserves the first occurrence of a variable. I'm looking at a panel data that traces the career of workers, and I'm trying to subset the data so that it only shows until each person became Boss.
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
1 1994 Bon Manager 0
2 1990 Jane Manager 0
2 1991 Jane Boss 1
2 1992 Jane Manager 0
2 1993 Jane Boss 1
So I would want the data to look like:
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
2 1990 Jane Manager 0
2 1991 Jane Boss 1
This seems like basic censoring but for the sake of my analysis this is crucial..! Any help would be appreciated.
Here's a dplyr solution that uses two useful window functions lag()
and cumall()
:
df <- read.table(header = TRUE, text = "
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
1 1994 Bon Manager 0
2 1990 Jane Manager 0
2 1991 Jane Boss 1
2 1992 Jane Manager 0
2 1993 Jane Boss 1
", stringsAsFactors = FALSE)
library(dplyr)
# Use mutate to see the values of the new variables
df %>%
group_by(id) %>%
mutate(last_job = lag(job, default = ""), cumall(last_job != "Boss"))
# Use filter to see the results
df %>%
group_by(id) %>%
filter(cumall(lag(job, default = "") != "Boss"))
We use lag()
to figure out what job each person had in the previous year, and then use cumall()
to keep all rows up to the first instance of "Boss". If the data wasn't already sorted by year, you could use lag(job, order_by = year)
to make sure lag()
used the value of year, rather than the row order, to determine which was "last" year.
Base solution:
do.call(
rbind,
by(dat,dat$name,function(x) {
if ("Boss" %in% x$job) x[1:min(which(x$job=="Boss")),]
})
)
# id year name job job2
#Bon.1 1 1990 Bon Manager 0
#Bon.2 1 1991 Bon Manager 0
#Bon.3 1 1992 Bon Manager 0
#Bon.4 1 1993 Bon Boss 1
#Jane.6 2 1990 Jane Manager 0
#Jane.7 2 1991 Jane Boss 1
An alternative base solution:
dat$keep <- with(dat,
ave(job=="Boss",name,FUN=function(x) if(1 %in% x) cumsum(x) else 2)
)
with(dat, dat[keep==0 | (job=="Boss" & keep==1),] )
# id year name job job2 keep
#1 1 1990 Bon Manager 0 0
#2 1 1991 Bon Manager 0 0
#3 1 1992 Bon Manager 0 0
#4 1 1993 Bon Boss 1 1
#6 2 1990 Jane Manager 0 0
#7 2 1991 Jane Boss 1 1
And a data.table
solution:
dat <- as.data.table(dat)
dat[,if("Boss" %in% job) .SD[1:min(which(job=="Boss"))],by=name]
# name id year job job2
#1: Bon 1 1990 Manager 0
#2: Bon 1 1991 Manager 0
#3: Bon 1 1992 Manager 0
#4: Bon 1 1993 Boss 1
#5: Jane 2 1990 Manager 0
#6: Jane 2 1991 Boss 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With