I have an unbalanced panel dataset in R. The following will serve as an example:
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)),
year=c(2001:2003,2000,2002,2000:2001,2003))
> dt
name year
1 A 2001
2 A 2002
3 A 2003
4 B 2000
5 B 2002
6 C 2000
7 C 2001
8 C 2003
Now, I need to have at least 2 consecutive year
observations for each name
. Hence, I would like to remove row 4, 5, and 8. How do I best do that in R?
EDIT:
Thanks to the comment below, I can make a bit clearer. If I had an extra observation (row 9) with name
=C
and year
=2004
, I would want to keep both row 8 and 9 along with the others.
My (hackish) way to do it would be:
is.consecutive = duplicated(rbind(dt,transform(dt, year=year+1),
transform(dt, year=year-1)),
fromLast=TRUE)[1:nrow(dt)]
is.consecutive
contains a vector of booleans of the observations to be retained. For your example, this vector would be: TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
Finally, you can easily use this vector to subset your data.frame, e.g. with:
dt[is.consecutive,]
Here's a more (far too...?) convoluted alternative, where you can set the minimum length of runs of consecutive observations.
dt <- dt[order(dt$name, dt$year), ]
rl <- 2
do.call(rbind,
by(dt, dt$name, function(x){
run <- c(0, cumsum(diff(x$year) > 1))
x[ave(run, run, FUN = length) >= rl, ]
})
)
# name year
# A.1 A 2001
# A.2 A 2002
# A.3 A 2003
# C.6 C 2000
# C.7 C 2001
rl <- 3
do.call(rbind,
by(dt, dt$name, function(x){
run <- c(0, cumsum(diff(x$year) > 1))
x[ave(run, run, FUN = length) >= rl, ]
})
)
# name year
# A.1 A 2001
# A.2 A 2002
# A.3 A 2003
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With