subsetting data to first occurrence in R

Question

I'm trying to subset the data so it only preserves the first occurrence of a variable. I'm looking at a panel data that traces the career of workers, and I'm trying to subset the data so that it only shows until each person became Boss.

id  year    name    job    job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
1   1994    Bon     Manager 0
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1
2   1992    Jane    Manager 0
2   1993    Jane    Boss    1

So I would want the data to look like:

id  year    name    job   job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1

This seems like basic censoring but for the sake of my analysis this is crucial..! Any help would be appreciated.

hadley · Accepted Answer

Here's a dplyr solution that uses two useful window functions lag() and cumall():

df <- read.table(header = TRUE, text = "
id  year    name    job    job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
1   1994    Bon     Manager 0
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1
2   1992    Jane    Manager 0
2   1993    Jane    Boss    1
", stringsAsFactors = FALSE)

library(dplyr)

# Use mutate to see the values of the new variables
df %>% 
  group_by(id) %>%
  mutate(last_job = lag(job, default = ""), cumall(last_job != "Boss"))

# Use filter to see the results
df %>% 
  group_by(id) %>%
  filter(cumall(lag(job, default = "") != "Boss"))

We use lag() to figure out what job each person had in the previous year, and then use cumall() to keep all rows up to the first instance of "Boss". If the data wasn't already sorted by year, you could use lag(job, order_by = year) to make sure lag() used the value of year, rather than the row order, to determine which was "last" year.

thelatemail · Answer

Base solution:

do.call(
  rbind,
  by(dat,dat$name,function(x) {
    if ("Boss" %in% x$job) x[1:min(which(x$job=="Boss")),]
  })
)

#       id year name     job job2
#Bon.1   1 1990  Bon Manager    0
#Bon.2   1 1991  Bon Manager    0
#Bon.3   1 1992  Bon Manager    0
#Bon.4   1 1993  Bon    Boss    1
#Jane.6  2 1990 Jane Manager    0
#Jane.7  2 1991 Jane    Boss    1

An alternative base solution:

dat$keep <- with(dat, 
             ave(job=="Boss",name,FUN=function(x) if(1 %in% x) cumsum(x) else 2) 
            )
with(dat, dat[keep==0 | (job=="Boss" & keep==1),] )

#  id year name     job job2 keep
#1  1 1990  Bon Manager    0    0
#2  1 1991  Bon Manager    0    0
#3  1 1992  Bon Manager    0    0
#4  1 1993  Bon    Boss    1    1
#6  2 1990 Jane Manager    0    0
#7  2 1991 Jane    Boss    1    1

And a data.table solution:

dat <- as.data.table(dat)
dat[,if("Boss" %in% job) .SD[1:min(which(job=="Boss"))],by=name]

#   name id year     job job2
#1:  Bon  1 1990 Manager    0
#2:  Bon  1 1991 Manager    0
#3:  Bon  1 1992 Manager    0
#4:  Bon  1 1993    Boss    1
#5: Jane  2 1990 Manager    0
#6: Jane  2 1991    Boss    1

subsetting data to first occurrence in R

Tags:

r

song0089

2 Answers

hadley

thelatemail

Recent Activity

Donate For Us

subsetting data to first occurrence in R

Tags:

r

song0089

2 Answers

hadley

thelatemail

Related questions

Recent Activity

Donate For Us