I'm trying to find runs of years in a data frame (ideally using plyr)
I'd like to get from this:
require(plyr)
dat<-data.frame(
name=c(rep("A", 11), rep("B", 11)),
year=c(2000:2010, 2000:2005, 2007:2011)
)
To this:
out<-data.frame(
name=c("A", "B", "B"),
range=c("2000-2010", "2000-2005", "2007-2011"))
It's easy enough to identify whether each group has a continuous run of years:
ddply(dat, .(name), summarise,
continuous=(max(year)-min(year))+1==length(year))
How do I go about breaking down group "B" into two ranges?
Any ideas or strategies would be really appreciated.
Thanks
The range is calculated by subtracting the lowest value from the highest value.
The range is the simple measurement of the difference between values in a dataset. To find the range, simply subtract the lowest value from the greatest value, ignoring the others.
Whether you use a function from "plyr" or from base R, you need to first establish some groups. One way to detect the change in groups since your years are sequential is to look for where diff
is not equal to 1. diff
creates a vector of length one less than the input vector, so we'll initialize that with "1" and take the cumsum
of the result.
Putting that mouthful of an explanation into practice, you can try something like this:
dat$id2 <- cumsum(c(1, diff(dat$year) != 1))
From here, you can use aggregate
or your favorite grouping function to get the output you're looking for.
aggregate(year ~ name + id2, dat, function(x) paste(min(x), max(x), sep = "-"))
# name id2 year
# 1 A 1 2000-2010
# 2 B 2 2000-2005
# 3 B 3 2007-2011
To use range
with aggregate
, you need to change sep
to collapse
, as below:
aggregate(year ~ name + id2, dat, function(x) paste(range(x), collapse = "-"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With