Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subsetting data to first occurrence in R

Tags:

r

I'm trying to subset the data so it only preserves the first occurrence of a variable. I'm looking at a panel data that traces the career of workers, and I'm trying to subset the data so that it only shows until each person became Boss.

id  year    name    job    job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
1   1994    Bon     Manager 0
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1
2   1992    Jane    Manager 0
2   1993    Jane    Boss    1

So I would want the data to look like:

id  year    name    job   job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1

This seems like basic censoring but for the sake of my analysis this is crucial..! Any help would be appreciated.

like image 776
song0089 Avatar asked Feb 07 '14 05:02

song0089


2 Answers

Here's a dplyr solution that uses two useful window functions lag() and cumall():

df <- read.table(header = TRUE, text = "
id  year    name    job    job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
1   1994    Bon     Manager 0
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1
2   1992    Jane    Manager 0
2   1993    Jane    Boss    1
", stringsAsFactors = FALSE)

library(dplyr)

# Use mutate to see the values of the new variables
df %>% 
  group_by(id) %>%
  mutate(last_job = lag(job, default = ""), cumall(last_job != "Boss"))

# Use filter to see the results
df %>% 
  group_by(id) %>%
  filter(cumall(lag(job, default = "") != "Boss"))

We use lag() to figure out what job each person had in the previous year, and then use cumall() to keep all rows up to the first instance of "Boss". If the data wasn't already sorted by year, you could use lag(job, order_by = year) to make sure lag() used the value of year, rather than the row order, to determine which was "last" year.

like image 185
hadley Avatar answered Nov 08 '22 23:11

hadley


Base solution:

do.call(
  rbind,
  by(dat,dat$name,function(x) {
    if ("Boss" %in% x$job) x[1:min(which(x$job=="Boss")),]
  })
)

#       id year name     job job2
#Bon.1   1 1990  Bon Manager    0
#Bon.2   1 1991  Bon Manager    0
#Bon.3   1 1992  Bon Manager    0
#Bon.4   1 1993  Bon    Boss    1
#Jane.6  2 1990 Jane Manager    0
#Jane.7  2 1991 Jane    Boss    1

An alternative base solution:

dat$keep <- with(dat, 
             ave(job=="Boss",name,FUN=function(x) if(1 %in% x) cumsum(x) else 2) 
            )
with(dat, dat[keep==0 | (job=="Boss" & keep==1),] )

#  id year name     job job2 keep
#1  1 1990  Bon Manager    0    0
#2  1 1991  Bon Manager    0    0
#3  1 1992  Bon Manager    0    0
#4  1 1993  Bon    Boss    1    1
#6  2 1990 Jane Manager    0    0
#7  2 1991 Jane    Boss    1    1

And a data.table solution:

dat <- as.data.table(dat)
dat[,if("Boss" %in% job) .SD[1:min(which(job=="Boss"))],by=name]

#   name id year     job job2
#1:  Bon  1 1990 Manager    0
#2:  Bon  1 1991 Manager    0
#3:  Bon  1 1992 Manager    0
#4:  Bon  1 1993    Boss    1
#5: Jane  2 1990 Manager    0
#6: Jane  2 1991    Boss    1
like image 3
thelatemail Avatar answered Nov 08 '22 23:11

thelatemail